Dogfood benchmark harness for speaker ID by ComputelessComputer · Pull Request #2 · fastrepl/unsigned-char

ComputelessComputer · 2026-04-16T18:06:26Z

Draft harness for measuring speaker identification quality. Part of #1.

What changed (pivoted from synthetic to dogfood)

Dropped the VoxCeleb/LibriSpeech stitcher. Replaced with a labeler that runs on your own Char meetings.

Rationale: synthetic meetings optimize the wrong thing. Read speech doesn't sound like a meeting. Celebrity interviews aren't in Korean. AMI is closer but still not Char conditions. Your last 20 meetings are the actual target distribution — label them in an evening, measure against that.

Pipeline

your meeting-*.md exports
     ↓  label.py (play 5s per speaker, type a name)
ground/<meeting_id>.json
     ↓  extract_embeddings.py (TODO — Swift bridge)
meetings/<meeting_id>/embeddings.json
     ↓  score.py (Python mirror of store.ts)
results/<name>.json
     ↓  report.py
accuracy · unknown_rejection · false_accept · calibration_error

Why four metrics, not one

store.ts has a 0.04 margin gate specifically to reject unknown speakers. A benchmark that only tracks identification accuracy pushes the optimizer to drop that gate and confidently mislabel every new person. score.py splits each meeting's humans in half — enrolled vs strangers — so regressions on rejection get caught.

What's still missing

extract_embeddings.py — the Swift speaker-embedding extractor needs a CLI entry point. Until that lands, embeddings.json doesn't get generated. Tracked in Build the speaker ID benchmark dataset (dogfood 20 meetings) #3.
baseline.json — populated once the pipeline runs end-to-end.
Raw meeting export — if speakers are already labeled in the app, the markdown has human names rather than Speaker 1/Speaker 2, and matcher guesses contaminate ground truth. Either label fresh meetings or add uchar meetings export --raw.

Status

Draft. Safe to land the labeler and scorer now; embedding extraction and baseline happen next.

Part of #1 and #3.

README, pinned deps, and gitignore for benchmarks/speaker-id. Harness measures speaker identification against labeled synthetic meetings.

stitch.py builds synthetic meetings from a speaker-keyed corpus (VCTK, LibriSpeech, VoxCeleb). Emits audio.wav and ground.json per meeting.

score.py re-implements scoreSpeakerProfile and recommendSpeakerProfile in Python so the harness can iterate on thresholds without rebuilding the Tauri app. Splits speakers into enrolled and stranger cohorts to measure unknown rejection.

report.py prints accuracy, unknown rejection, false accept, and calibration error. Diffs against baseline.json when provided.

Stitching LibriSpeech into fake meetings optimizes the wrong thing. Replacing with a labeler that runs on actual Char meetings — the conditions we ship in.

label.py plays a random long turn per speaker and asks who they are. One pass per meeting, not per turn — a 3-speaker meeting takes a minute. macOS-first: ffmpeg to slice, afplay to preview.

Consume the per-speaker labels emitted by label.py instead of synthetic per-turn fixtures. Split enrolled vs stranger identities per meeting so unknown-speaker rejection stays in the metric set.

Drops the tiered-corpus plan. Measures speaker ID on the user's own labeled meetings. Gitignore adds ground/ and meetings/ to keep voice samples off GitHub.

ComputelessComputer added 4 commits April 17, 2026 03:06

Add speaker ID benchmark harness scaffold

831cee5

README, pinned deps, and gitignore for benchmarks/speaker-id. Harness measures speaker identification against labeled synthetic meetings.

Add meeting stitcher for speaker ID benchmark

ccfbc82

stitch.py builds synthetic meetings from a speaker-keyed corpus (VCTK, LibriSpeech, VoxCeleb). Emits audio.wav and ground.json per meeting.

Add metrics report with baseline diff

015123a

report.py prints accuracy, unknown rejection, false accept, and calibration error. Diffs against baseline.json when provided.

ComputelessComputer mentioned this pull request Apr 16, 2026

Build the speaker ID benchmark dataset (dogfood 20 meetings) #3

Open

ComputelessComputer added 4 commits April 17, 2026 03:12

Drop synthetic stitcher from speaker ID benchmark

93d4883

Stitching LibriSpeech into fake meetings optimizes the wrong thing. Replacing with a labeler that runs on actual Char meetings — the conditions we ship in.

Add speaker labeler for real Char meetings

30935ad

label.py plays a random long turn per speaker and asks who they are. One pass per meeting, not per turn — a 3-speaker meeting takes a minute. macOS-first: ffmpeg to slice, afplay to preview.

Rework scorer for per-speaker ground truth

cd3c221

Consume the per-speaker labels emitted by label.py instead of synthetic per-turn fixtures. Split enrolled vs stranger identities per meeting so unknown-speaker rejection stays in the metric set.

Rewrite benchmark README around dogfood workflow

de2e606

Drops the tiered-corpus plan. Measures speaker ID on the user's own labeled meetings. Gitignore adds ground/ and meetings/ to keep voice samples off GitHub.

ComputelessComputer changed the title ~~Add speaker ID benchmark harness~~ Dogfood benchmark harness for speaker ID Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dogfood benchmark harness for speaker ID#2

Dogfood benchmark harness for speaker ID#2
ComputelessComputer wants to merge 8 commits intomainfrom
speaker-id/benchmark-harness

ComputelessComputer commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ComputelessComputer commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed (pivoted from synthetic to dogfood)

Pipeline

Why four metrics, not one

What's still missing

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ComputelessComputer commented Apr 16, 2026 •

edited

Loading