Reassign excess diarized speakers by embedding similarity by ComputelessComputer · Pull Request #6 · fastrepl/unsigned-char

ComputelessComputer · 2026-04-17T04:42:39Z

Summary

When the user specifies a speaker count and Sortformer returns more clusters than requested, we now reassign segments from excess clusters using speaker embedding similarity instead of temporal adjacency. Only engages in the "Sortformer returned too many speakers" case; all other paths are unchanged.

The bug

constrainDiarizedSegments collapses excess clusters by picking each excess segment's nearest retained segment by time gap (diarizedSegmentDistance). That merges a short interjection by one speaker into whichever retained speaker happened to be talking around the same time — a heuristic that ignores acoustic evidence.

What changed

src-tauri/swift-permissions/src/speech_bridge.swift:

New constrainDiarizedSegmentsUsingEmbeddings alongside the existing function.
DiarizationPipeline lazily loads WeSpeakerModel (same model the speaker-embedding pipeline already loads, so no new download on machines that have already used the speaker library).
Reassignment path runs only when the user provided a speaker count AND Sortformer returned more clusters than that.
For each retained cluster, compute a centroid embedding. For each excess segment ≥1 s, embed and assign to the cluster with max cosine similarity. Segments shorter than 1 s or unembeddable fall back to the previous temporal heuristic so nothing gets dropped.

Why it helps

Short interjections from a third speaker were reliably being merged into whichever retained speaker was adjacent. With embeddings they now go to the voice they actually match.

What's NOT in this PR

VAD pre-pass before Sortformer (needs a product decision about an extra model download).
Stratified embedding sampling in selectSpeakerEmbeddingSegments.
Post-merge of adjacent same-speaker turns.
L2-normalizing per sample inside normalizedEmbeddingCentroid — shipped in a separate PR.

Testing notes

Swift-only change; bun run build for the frontend still passes. Automatic mode (no requested speaker count) takes the unchanged fast path. The embedding path only runs when Sortformer's raw output has more speakers than requested, which is the concrete regression scenario; all other flows hit the original constrainDiarizedSegments.

Recommend validating against a small set of known meetings where a short-turn speaker used to get swallowed into the adjacent speaker — that's the failure mode this fixes.

Addresses #4.

When the user requests N speakers but Sortformer returns more, the old constrainer reassigned excess segments to the temporally-nearest retained speaker. That ignores voice identity — a short interjection by Alice gets merged into Bob just because Bob was speaking around the same time. The new path loads WeSpeaker (already used by the speaker embedding pipeline) lazily in DiarizationPipeline, builds a centroid embedding per retained speaker, embeds each excess segment, and assigns it to the max-cosine-similarity retained centroid. Segments shorter than 1 s fall back to the old temporal heuristic because WeSpeaker is unreliable on very short clips. Only engages when the user set a specific speaker count AND Sortformer's raw output exceeded it. Automatic mode, under-count cases, and short-audio edge cases all take the unchanged fast path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reassign excess diarized speakers by embedding similarity#6

Reassign excess diarized speakers by embedding similarity#6
ComputelessComputer wants to merge 1 commit intomainfrom
fix/embedding-based-speaker-reassignment

ComputelessComputer commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ComputelessComputer commented Apr 17, 2026

Summary

The bug

What changed

Why it helps

What's NOT in this PR

Testing notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant