Dogfood benchmark harness for speaker ID#2
Draft
ComputelessComputer wants to merge 8 commits intomainfrom
Draft
Conversation
README, pinned deps, and gitignore for benchmarks/speaker-id. Harness measures speaker identification against labeled synthetic meetings.
stitch.py builds synthetic meetings from a speaker-keyed corpus (VCTK, LibriSpeech, VoxCeleb). Emits audio.wav and ground.json per meeting.
score.py re-implements scoreSpeakerProfile and recommendSpeakerProfile in Python so the harness can iterate on thresholds without rebuilding the Tauri app. Splits speakers into enrolled and stranger cohorts to measure unknown rejection.
report.py prints accuracy, unknown rejection, false accept, and calibration error. Diffs against baseline.json when provided.
Stitching LibriSpeech into fake meetings optimizes the wrong thing. Replacing with a labeler that runs on actual Char meetings — the conditions we ship in.
label.py plays a random long turn per speaker and asks who they are. One pass per meeting, not per turn — a 3-speaker meeting takes a minute. macOS-first: ffmpeg to slice, afplay to preview.
Consume the per-speaker labels emitted by label.py instead of synthetic per-turn fixtures. Split enrolled vs stranger identities per meeting so unknown-speaker rejection stays in the metric set.
Drops the tiered-corpus plan. Measures speaker ID on the user's own labeled meetings. Gitignore adds ground/ and meetings/ to keep voice samples off GitHub.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft harness for measuring speaker identification quality. Part of #1.
What changed (pivoted from synthetic to dogfood)
Dropped the VoxCeleb/LibriSpeech stitcher. Replaced with a labeler that runs on your own Char meetings.
Rationale: synthetic meetings optimize the wrong thing. Read speech doesn't sound like a meeting. Celebrity interviews aren't in Korean. AMI is closer but still not Char conditions. Your last 20 meetings are the actual target distribution — label them in an evening, measure against that.
Pipeline
Why four metrics, not one
store.tshas a 0.04 margin gate specifically to reject unknown speakers. A benchmark that only tracks identification accuracy pushes the optimizer to drop that gate and confidently mislabel every new person.score.pysplits each meeting's humans in half — enrolled vs strangers — so regressions on rejection get caught.What's still missing
extract_embeddings.py— the Swift speaker-embedding extractor needs a CLI entry point. Until that lands,embeddings.jsondoesn't get generated. Tracked in Build the speaker ID benchmark dataset (dogfood 20 meetings) #3.baseline.json— populated once the pipeline runs end-to-end.Speaker 1/Speaker 2, and matcher guesses contaminate ground truth. Either label fresh meetings or adduchar meetings export --raw.Status
Draft. Safe to land the labeler and scorer now; embedding extraction and baseline happen next.
Part of #1 and #3.