Skip to content

Dogfood benchmark harness for speaker ID#2

Draft
ComputelessComputer wants to merge 8 commits intomainfrom
speaker-id/benchmark-harness
Draft

Dogfood benchmark harness for speaker ID#2
ComputelessComputer wants to merge 8 commits intomainfrom
speaker-id/benchmark-harness

Conversation

@ComputelessComputer
Copy link
Copy Markdown
Collaborator

@ComputelessComputer ComputelessComputer commented Apr 16, 2026

Draft harness for measuring speaker identification quality. Part of #1.

What changed (pivoted from synthetic to dogfood)

Dropped the VoxCeleb/LibriSpeech stitcher. Replaced with a labeler that runs on your own Char meetings.

Rationale: synthetic meetings optimize the wrong thing. Read speech doesn't sound like a meeting. Celebrity interviews aren't in Korean. AMI is closer but still not Char conditions. Your last 20 meetings are the actual target distribution — label them in an evening, measure against that.

Pipeline

your meeting-*.md exports
     ↓  label.py (play 5s per speaker, type a name)
ground/<meeting_id>.json
     ↓  extract_embeddings.py (TODO — Swift bridge)
meetings/<meeting_id>/embeddings.json
     ↓  score.py (Python mirror of store.ts)
results/<name>.json
     ↓  report.py
accuracy · unknown_rejection · false_accept · calibration_error

Why four metrics, not one

store.ts has a 0.04 margin gate specifically to reject unknown speakers. A benchmark that only tracks identification accuracy pushes the optimizer to drop that gate and confidently mislabel every new person. score.py splits each meeting's humans in half — enrolled vs strangers — so regressions on rejection get caught.

What's still missing

  • extract_embeddings.py — the Swift speaker-embedding extractor needs a CLI entry point. Until that lands, embeddings.json doesn't get generated. Tracked in Build the speaker ID benchmark dataset (dogfood 20 meetings) #3.
  • baseline.json — populated once the pipeline runs end-to-end.
  • Raw meeting export — if speakers are already labeled in the app, the markdown has human names rather than Speaker 1/Speaker 2, and matcher guesses contaminate ground truth. Either label fresh meetings or add uchar meetings export --raw.

Status

Draft. Safe to land the labeler and scorer now; embedding extraction and baseline happen next.

Part of #1 and #3.

README, pinned deps, and gitignore for benchmarks/speaker-id. Harness measures speaker identification against labeled synthetic meetings.
stitch.py builds synthetic meetings from a speaker-keyed corpus (VCTK, LibriSpeech, VoxCeleb). Emits audio.wav and ground.json per meeting.
score.py re-implements scoreSpeakerProfile and recommendSpeakerProfile in Python so the harness can iterate on thresholds without rebuilding the Tauri app. Splits speakers into enrolled and stranger cohorts to measure unknown rejection.
report.py prints accuracy, unknown rejection, false accept, and calibration error. Diffs against baseline.json when provided.
Stitching LibriSpeech into fake meetings optimizes the wrong thing. Replacing with a labeler that runs on actual Char meetings — the conditions we ship in.
label.py plays a random long turn per speaker and asks who they are. One pass per meeting, not per turn — a 3-speaker meeting takes a minute. macOS-first: ffmpeg to slice, afplay to preview.
Consume the per-speaker labels emitted by label.py instead of synthetic per-turn fixtures. Split enrolled vs stranger identities per meeting so unknown-speaker rejection stays in the metric set.
Drops the tiered-corpus plan. Measures speaker ID on the user's own labeled meetings. Gitignore adds ground/ and meetings/ to keep voice samples off GitHub.
@ComputelessComputer ComputelessComputer changed the title Add speaker ID benchmark harness Dogfood benchmark harness for speaker ID Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant