Reassign excess diarized speakers by embedding similarity#6
Open
ComputelessComputer wants to merge 1 commit intomainfrom
Open
Reassign excess diarized speakers by embedding similarity#6ComputelessComputer wants to merge 1 commit intomainfrom
ComputelessComputer wants to merge 1 commit intomainfrom
Conversation
When the user requests N speakers but Sortformer returns more, the old constrainer reassigned excess segments to the temporally-nearest retained speaker. That ignores voice identity — a short interjection by Alice gets merged into Bob just because Bob was speaking around the same time. The new path loads WeSpeaker (already used by the speaker embedding pipeline) lazily in DiarizationPipeline, builds a centroid embedding per retained speaker, embeds each excess segment, and assigns it to the max-cosine-similarity retained centroid. Segments shorter than 1 s fall back to the old temporal heuristic because WeSpeaker is unreliable on very short clips. Only engages when the user set a specific speaker count AND Sortformer's raw output exceeded it. Automatic mode, under-count cases, and short-audio edge cases all take the unchanged fast path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the user specifies a speaker count and Sortformer returns more clusters than requested, we now reassign segments from excess clusters using speaker embedding similarity instead of temporal adjacency. Only engages in the "Sortformer returned too many speakers" case; all other paths are unchanged.
The bug
constrainDiarizedSegmentscollapses excess clusters by picking each excess segment's nearest retained segment by time gap (diarizedSegmentDistance). That merges a short interjection by one speaker into whichever retained speaker happened to be talking around the same time — a heuristic that ignores acoustic evidence.What changed
src-tauri/swift-permissions/src/speech_bridge.swift:constrainDiarizedSegmentsUsingEmbeddingsalongside the existing function.DiarizationPipelinelazily loadsWeSpeakerModel(same model the speaker-embedding pipeline already loads, so no new download on machines that have already used the speaker library).Why it helps
Short interjections from a third speaker were reliably being merged into whichever retained speaker was adjacent. With embeddings they now go to the voice they actually match.
What's NOT in this PR
selectSpeakerEmbeddingSegments.normalizedEmbeddingCentroid— shipped in a separate PR.Testing notes
Swift-only change;
bun run buildfor the frontend still passes. Automatic mode (no requested speaker count) takes the unchanged fast path. The embedding path only runs when Sortformer's raw output has more speakers than requested, which is the concrete regression scenario; all other flows hit the originalconstrainDiarizedSegments.Recommend validating against a small set of known meetings where a short-turn speaker used to get swallowed into the adjacent speaker — that's the failure mode this fixes.
Addresses #4.