Skip to content

Reassign excess diarized speakers by embedding similarity#6

Open
ComputelessComputer wants to merge 1 commit intomainfrom
fix/embedding-based-speaker-reassignment
Open

Reassign excess diarized speakers by embedding similarity#6
ComputelessComputer wants to merge 1 commit intomainfrom
fix/embedding-based-speaker-reassignment

Conversation

@ComputelessComputer
Copy link
Copy Markdown
Collaborator

Summary

When the user specifies a speaker count and Sortformer returns more clusters than requested, we now reassign segments from excess clusters using speaker embedding similarity instead of temporal adjacency. Only engages in the "Sortformer returned too many speakers" case; all other paths are unchanged.

The bug

constrainDiarizedSegments collapses excess clusters by picking each excess segment's nearest retained segment by time gap (diarizedSegmentDistance). That merges a short interjection by one speaker into whichever retained speaker happened to be talking around the same time — a heuristic that ignores acoustic evidence.

What changed

src-tauri/swift-permissions/src/speech_bridge.swift:

  • New constrainDiarizedSegmentsUsingEmbeddings alongside the existing function.
  • DiarizationPipeline lazily loads WeSpeakerModel (same model the speaker-embedding pipeline already loads, so no new download on machines that have already used the speaker library).
  • Reassignment path runs only when the user provided a speaker count AND Sortformer returned more clusters than that.
  • For each retained cluster, compute a centroid embedding. For each excess segment ≥1 s, embed and assign to the cluster with max cosine similarity. Segments shorter than 1 s or unembeddable fall back to the previous temporal heuristic so nothing gets dropped.

Why it helps

Short interjections from a third speaker were reliably being merged into whichever retained speaker was adjacent. With embeddings they now go to the voice they actually match.

What's NOT in this PR

  • VAD pre-pass before Sortformer (needs a product decision about an extra model download).
  • Stratified embedding sampling in selectSpeakerEmbeddingSegments.
  • Post-merge of adjacent same-speaker turns.
  • L2-normalizing per sample inside normalizedEmbeddingCentroid — shipped in a separate PR.

Testing notes

Swift-only change; bun run build for the frontend still passes. Automatic mode (no requested speaker count) takes the unchanged fast path. The embedding path only runs when Sortformer's raw output has more speakers than requested, which is the concrete regression scenario; all other flows hit the original constrainDiarizedSegments.

Recommend validating against a small set of known meetings where a short-turn speaker used to get swallowed into the adjacent speaker — that's the failure mode this fixes.

Addresses #4.

When the user requests N speakers but Sortformer returns more, the old constrainer reassigned excess segments to the temporally-nearest retained speaker. That ignores voice identity — a short interjection by Alice gets merged into Bob just because Bob was speaking around the same time.

The new path loads WeSpeaker (already used by the speaker embedding pipeline) lazily in DiarizationPipeline, builds a centroid embedding per retained speaker, embeds each excess segment, and assigns it to the max-cosine-similarity retained centroid. Segments shorter than 1 s fall back to the old temporal heuristic because WeSpeaker is unreliable on very short clips.

Only engages when the user set a specific speaker count AND Sortformer's raw output exceeded it. Automatic mode, under-count cases, and short-audio edge cases all take the unchanged fast path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant