Speaker-tower KV inversion for the Echo TTS model. Per-character conditioning tensors learned by gradient-descent search against a frozen backbone — the audio analog of Textual Inversion.
The one-line version: treat the speaker tower's output activations (H_spk) as the learnable parameter, freeze everything else in Echo, and optimize against the rectified-flow velocity loss on ~10–20 target clips. The result is a ~3.3 MB tensor that drops into Echo's existing speaker-conditioning pathway at inference. No architectural changes.
The full spec is in Speaker-Tower-KV-Inversionn.pdf and CLAUDE.md.
- ✓ A parameter-efficient few-shot speaker/character adaptation procedure for an Echo-class rectified-flow DiT.
- ✓ A drop-in inference path: an inverted
H_spksubstitutes for the live speaker-tower output. The rest of Echo's pipeline is untouched. - ✓ A research substrate for style transfer (Phase 10): velocity-space arithmetic to render an arbitrary speaker in a character's prosody.
- ✗ Not a rebuild of Echo, not a weight-space fine-tuner, not a generic adaptation framework.
echolite/ # the package
├── backbone.py # frozen Echo adapter: load_echo(), H_spk_to_kv_cache_speaker()
├── trainer.py # Algorithm 1 — gradient-descent search for one H_spk
├── inference.py # generate(): single-H_spk substitution into Echo's Euler sampler
├── contrastive.py # generate_contrastive(): 4-forward CFG with H_c + H_neut
├── transfer.py # generate_with_style_transfer(): A-speaker × B-style (Phase 10)
├── artifact.py # save/load: H_spk.safetensors + sidecar.yaml + train_curve.jsonl
├── diagnostics.py # check-init, check-overfit, held-out loss, cosine-to-init
├── config.py # TrainConfig defaults (S=4000, lr=1e-3, AdamW)
├── data.py # clip loading, PCA latents, patched masks, timestep sampler
├── per_layer_kv.py # gated fallback (per-block K/V inversion)
└── cli.py # `echolite invert|generate|check-init|check-overfit`
scripts/ # phase-N runners (CPU prep, Slurm-submitted training, batch arrays)
tests/ # pytest scaffolding
workflows/ # workflow.md (phase spec) + plan-of-action.md (execution order)
reports/ # per-phase reports (phase_N__slug.md) — written as work completes
presentations/ # PI-facing decks
characters/ # per-character artifacts and notes (gitignored — see below)
data/ # source clips + dataset manifests (gitignored)
The characters/ and data/ directories hold the actual artifacts, source audio, and inversion tensors. They are excluded from version control — only the code, reports, and per-character notes that go with the code live in the repo.
Echo's source (the backbone) is not included in this repository and must be made available locally. The adapter echolite/backbone.py expects to import it from a sibling echo-tts/ directory at the repo root.
# 1. Clone this repo and place Echo's source next to it
git clone git@github.com:deepgram/echolite.git
cd echolite
# place echo-tts/ here (private repo / internal release)
# 2. Install with uv (or pip)
uv sync # creates .venv, resolves uv.lock
# or: pip install -e .
# 3. (Optional) sanity-import
python -c "from echolite.backbone import load_echo"GPU is required for invert and generate. The backbone, Fish DAC, and PCA state are pulled from HuggingFace on first call (jordand/echo-tts-base, jordand/fish-s1-dac-min).
The CLI has four subcommands:
echolite invert --character ID --clips MANIFEST.jsonl [--steps S --lr LR]
echolite generate --character-dir characters/<id>/ --text "..." --out OUT.wav [--style v1|v2]
echolite check-init --character ID --clips MANIFEST.jsonl --out-dir DIR
echolite check-overfit --character ID --clip CLIP_ID --clips MANIFEST.jsonl --out-dir DIRAll GPU work is intended to run via Slurm (--partition=main); see workflows/workflow.md and the sbatch templates in scripts/.
JSONL, one record per line:
{"clip_id": "imHL89QjPP0__spk0__0105.45-0120.41", "audio_path": "clips/imHL89QjPP0__spk0__0105.45-0120.41.wav", "transcript": "Gravy. So can you still tap into the aquatic supply stores..."}audio_path may be absolute or relative to the manifest's parent directory.
echolite invert \
--character ace_ventura__22924144 \
--clips characters/ace_ventura__22924144/clips_manifest.jsonl \
--steps 4000 --lr 1e-3Defaults are the CLAUDE.md recipe: AdamW, lr=1e-3, weight_decay=0, S=4000 steps, stratified logit-normal timestep sampler. The trainer holds out 2 of N clips for validation, monitors per-clip held-out loss + cosine(H_spk, H_spk^(0)), and emits one StepLog per validation step. Output lands at:
characters/<id>/
├── H_spk.safetensors # bf16 H_spk + uint8 padding mask
├── sidecar.yaml # init clip ids, audio sha256s, transcripts, hyperparams, backbone version
└── train_curve.jsonl # per-step loss/cosine/grad-norm
The artifact is self-describing — every field needed to re-derive it from the source clips is in sidecar.yaml. Before scaling to a new character, always run the sanity checks (next section).
Single-speaker generation, using the standard Echo Euler sampler with H_spk substituted for the live speaker tower:
echolite generate \
--character-dir characters/ace_ventura__22924144 \
--text "Alrighty then. Time to find that albino pigeon." \
--out out.wav \
--style v2--style selects a CFG preset that matches Echo's recommended values for spoken (v1: cfg_text=3.0, cfg_speaker=10.0) or sung-leaning (v2: cfg_text=5.0, cfg_speaker=8.0) material. Omit it to use Echo's own internal defaults.
Programmatic use:
from echolite.backbone import load_echo
from echolite.artifact import load
from echolite.inference import generate
import torch, torchaudio
bundle = load_echo(device="cuda", dtype=torch.bfloat16)
art = load("characters/ace_ventura__22924144")
audio, debug = generate(
bundle, art.H_spk, art.mask,
"Alrighty then. Time to find that albino pigeon.",
num_steps=40, cfg_scale_text=3.0, cfg_scale_speaker=10.0,
)
torchaudio.save("out.wav", audio.squeeze().float().cpu().unsqueeze(0), 44_100)If you have neutral material for the same performer (interviews, narration), invert a second tensor H_neut and use echolite.contrastive.generate_contrastive to subtract performer identity from the guidance signal. From the tech note:
v_guided = v_θ(·; H_c)
+ w_pos · ( v_θ(·; H_c) − v_uncond )
− w_neg · ( v_θ(·; H_neut) − v_uncond )
from echolite.contrastive import generate_contrastive
audio, debug = generate_contrastive(
bundle,
H_c=art_c.H_spk, mask_c=art_c.mask,
H_neut=art_neut.H_spk, mask_neut=art_neut.mask,
transcript="There you are. The shadow man is in.",
w_pos=1.5, w_neg=0.5, # CLAUDE.md defaults; tune per character
)Helpful when the character is only weakly represented in the base manifold and scaling w_pos alone doesn't separate the character from the performer.
echolite.transfer.generate_with_style_transfer takes a source-speaker reference audio, a character inversion H_c, and the matching performer-neutral inversion H_neut, and renders the source speaker delivering arbitrary text in the character's prosody. The construction is velocity-space arithmetic:
v_anchor = v_θ(·; H_arb) # H_arb = SpeakerEncoder(source_audio), live
v_c = v_θ(·; H_c)
v_n = v_θ(·; H_neut)
v_uncond_t = v_θ(·; H_arb, text_mask=0)
v_guided = v_anchor + w_text · (v_anchor − v_uncond_t)
+ w_style · (v_c − v_n)
The (v_c − v_n) term is the "character-ness, performer-removed" direction. Added to the source speaker's anchor it transfers the style without dragging in the original performer's vocal identity.
from echolite.transfer import generate_with_style_transfer
audio, debug = generate_with_style_transfer(
bundle,
source_audio_path="reference_speaker.wav",
H_c=art_c.H_spk, mask_c=art_c.mask,
H_neut=art_neut.H_spk, mask_neut=art_neut.mask,
transcript="Now what's gonna be? Are you in or out?",
w_style=1.0, w_text=3.0,
)This is the research-heavier of the two end goals — when w_style is too high, prosody can warp; too low and the source speaker dominates. Tune per source on held-out generations. See reports/phase_10__speaker_transfer.md and reports/applying_facilier_to_arbitrary_characters.md for the recipe + listening-test findings.
Two cheap diagnostics catch ~all setup bugs before you spend GPU hours on a full inversion. From CLAUDE.md "Sanity Checks":
# 1. Initialization-only baseline — does the init alone already sound right?
echolite check-init \
--character new_char \
--clips characters/new_char/clips_manifest.jsonl \
--out-dir characters/new_char/_sanity
# 2. Single-clip overfit — can we reconstruct one clip from itself in 500 steps?
echolite check-overfit \
--character new_char \
--clip <one_clip_id_from_manifest> \
--clips characters/new_char/clips_manifest.jsonl \
--out-dir characters/new_char/_sanity \
--steps 500A failed init-only check usually means the init pipeline is broken (wrong audio path, dtype mismatch, padding bug). A failed single-clip overfit means optimization itself is broken (most often: requires_grad set on the wrong tensor). Both passing → proceed to the full invert run.
During a full run, the two diagnostics to watch are per-clip held-out reconstruction loss (primary early-stopping signal) and cosine(H_spk, H_spk^(0)) (≈1.0 → optimization barely moved; ≈0.0 → drifted off the manifold). Both are emitted to train_curve.jsonl and printed by the CLI.
The unit of distribution. ~3.3 MB fp32, 1.6 MB bf16 at T_spk=640, d_spk=1280. One tensor per character; vectors do not transfer between characters.
characters/<id>/
├── H_spk.safetensors # tensor + padding mask
├── sidecar.yaml # init manifest, hyperparams, audio sha256s, backbone version
├── clips_manifest.jsonl # source clips with transcripts
├── train_curve.jsonl # per-step loss / cosine / grad-norm
└── notes.md # qualitative voice notes, deviations from defaults
Load it from anywhere with echolite.artifact.load(path); substitute into Echo with echolite.inference.generate(...). Nothing else in the Echo deployment changes.
CLAUDE.md— full method, defaults, how to behave in the project (the operative spec for autonomous and assisted work).Speaker-Tower-KV-Inversionn.pdf— original tech note.workflows/workflow.md— the 11 phases (0–10), what each delivers, gating decisions.workflows/plan-of-action.md— execution order + dependency graph for the phases.reports/phase_*.md— what was actually done, decisions, numbers, artifacts produced.presentations/— PI-facing summaries.
Internal — see repository settings.