Skip to content

deepgram/echolite

Repository files navigation

echolite

Speaker-tower KV inversion for the Echo TTS model. Per-character conditioning tensors learned by gradient-descent search against a frozen backbone — the audio analog of Textual Inversion.

The one-line version: treat the speaker tower's output activations (H_spk) as the learnable parameter, freeze everything else in Echo, and optimize against the rectified-flow velocity loss on ~10–20 target clips. The result is a ~3.3 MB tensor that drops into Echo's existing speaker-conditioning pathway at inference. No architectural changes.

The full spec is in Speaker-Tower-KV-Inversionn.pdf and CLAUDE.md.


What this is (and isn't)

  • ✓ A parameter-efficient few-shot speaker/character adaptation procedure for an Echo-class rectified-flow DiT.
  • ✓ A drop-in inference path: an inverted H_spk substitutes for the live speaker-tower output. The rest of Echo's pipeline is untouched.
  • ✓ A research substrate for style transfer (Phase 10): velocity-space arithmetic to render an arbitrary speaker in a character's prosody.
  • ✗ Not a rebuild of Echo, not a weight-space fine-tuner, not a generic adaptation framework.

Repository layout

echolite/                   # the package
├── backbone.py             # frozen Echo adapter: load_echo(), H_spk_to_kv_cache_speaker()
├── trainer.py              # Algorithm 1 — gradient-descent search for one H_spk
├── inference.py            # generate(): single-H_spk substitution into Echo's Euler sampler
├── contrastive.py          # generate_contrastive(): 4-forward CFG with H_c + H_neut
├── transfer.py             # generate_with_style_transfer(): A-speaker × B-style (Phase 10)
├── artifact.py             # save/load: H_spk.safetensors + sidecar.yaml + train_curve.jsonl
├── diagnostics.py          # check-init, check-overfit, held-out loss, cosine-to-init
├── config.py               # TrainConfig defaults (S=4000, lr=1e-3, AdamW)
├── data.py                 # clip loading, PCA latents, patched masks, timestep sampler
├── per_layer_kv.py         # gated fallback (per-block K/V inversion)
└── cli.py                  # `echolite invert|generate|check-init|check-overfit`

scripts/        # phase-N runners (CPU prep, Slurm-submitted training, batch arrays)
tests/          # pytest scaffolding
workflows/      # workflow.md (phase spec) + plan-of-action.md (execution order)
reports/        # per-phase reports (phase_N__slug.md) — written as work completes
presentations/  # PI-facing decks
characters/     # per-character artifacts and notes (gitignored — see below)
data/           # source clips + dataset manifests (gitignored)

The characters/ and data/ directories hold the actual artifacts, source audio, and inversion tensors. They are excluded from version control — only the code, reports, and per-character notes that go with the code live in the repo.


Installation

Echo's source (the backbone) is not included in this repository and must be made available locally. The adapter echolite/backbone.py expects to import it from a sibling echo-tts/ directory at the repo root.

# 1. Clone this repo and place Echo's source next to it
git clone git@github.com:deepgram/echolite.git
cd echolite
# place echo-tts/ here (private repo / internal release)

# 2. Install with uv (or pip)
uv sync                        # creates .venv, resolves uv.lock
# or:  pip install -e .

# 3. (Optional) sanity-import
python -c "from echolite.backbone import load_echo"

GPU is required for invert and generate. The backbone, Fish DAC, and PCA state are pulled from HuggingFace on first call (jordand/echo-tts-base, jordand/fish-s1-dac-min).


Workflows

The CLI has four subcommands:

echolite invert        --character ID --clips MANIFEST.jsonl [--steps S --lr LR]
echolite generate      --character-dir characters/<id>/ --text "..." --out OUT.wav [--style v1|v2]
echolite check-init    --character ID --clips MANIFEST.jsonl --out-dir DIR
echolite check-overfit --character ID --clip CLIP_ID --clips MANIFEST.jsonl --out-dir DIR

All GPU work is intended to run via Slurm (--partition=main); see workflows/workflow.md and the sbatch templates in scripts/.

Clip manifest format

JSONL, one record per line:

{"clip_id": "imHL89QjPP0__spk0__0105.45-0120.41", "audio_path": "clips/imHL89QjPP0__spk0__0105.45-0120.41.wav", "transcript": "Gravy. So can you still tap into the aquatic supply stores..."}

audio_path may be absolute or relative to the manifest's parent directory.

1 — Invert a new character

echolite invert \
  --character ace_ventura__22924144 \
  --clips characters/ace_ventura__22924144/clips_manifest.jsonl \
  --steps 4000 --lr 1e-3

Defaults are the CLAUDE.md recipe: AdamW, lr=1e-3, weight_decay=0, S=4000 steps, stratified logit-normal timestep sampler. The trainer holds out 2 of N clips for validation, monitors per-clip held-out loss + cosine(H_spk, H_spk^(0)), and emits one StepLog per validation step. Output lands at:

characters/<id>/
├── H_spk.safetensors    # bf16 H_spk + uint8 padding mask
├── sidecar.yaml         # init clip ids, audio sha256s, transcripts, hyperparams, backbone version
└── train_curve.jsonl    # per-step loss/cosine/grad-norm

The artifact is self-describing — every field needed to re-derive it from the source clips is in sidecar.yaml. Before scaling to a new character, always run the sanity checks (next section).

2 — Generate audio from an artifact

Single-speaker generation, using the standard Echo Euler sampler with H_spk substituted for the live speaker tower:

echolite generate \
  --character-dir characters/ace_ventura__22924144 \
  --text "Alrighty then. Time to find that albino pigeon." \
  --out out.wav \
  --style v2

--style selects a CFG preset that matches Echo's recommended values for spoken (v1: cfg_text=3.0, cfg_speaker=10.0) or sung-leaning (v2: cfg_text=5.0, cfg_speaker=8.0) material. Omit it to use Echo's own internal defaults.

Programmatic use:

from echolite.backbone import load_echo
from echolite.artifact import load
from echolite.inference import generate
import torch, torchaudio

bundle = load_echo(device="cuda", dtype=torch.bfloat16)
art = load("characters/ace_ventura__22924144")
audio, debug = generate(
    bundle, art.H_spk, art.mask,
    "Alrighty then. Time to find that albino pigeon.",
    num_steps=40, cfg_scale_text=3.0, cfg_scale_speaker=10.0,
)
torchaudio.save("out.wav", audio.squeeze().float().cpu().unsqueeze(0), 44_100)

3 — Contrastive (negative) guidance against performer-neutral voice

If you have neutral material for the same performer (interviews, narration), invert a second tensor H_neut and use echolite.contrastive.generate_contrastive to subtract performer identity from the guidance signal. From the tech note:

v_guided = v_θ(·; H_c)
         + w_pos · ( v_θ(·; H_c)    − v_uncond )
         − w_neg · ( v_θ(·; H_neut) − v_uncond )
from echolite.contrastive import generate_contrastive

audio, debug = generate_contrastive(
    bundle,
    H_c=art_c.H_spk, mask_c=art_c.mask,
    H_neut=art_neut.H_spk, mask_neut=art_neut.mask,
    transcript="There you are. The shadow man is in.",
    w_pos=1.5, w_neg=0.5,   # CLAUDE.md defaults; tune per character
)

Helpful when the character is only weakly represented in the base manifold and scaling w_pos alone doesn't separate the character from the performer.

4 — Style transfer: arbitrary speaker × character style (A → B)

echolite.transfer.generate_with_style_transfer takes a source-speaker reference audio, a character inversion H_c, and the matching performer-neutral inversion H_neut, and renders the source speaker delivering arbitrary text in the character's prosody. The construction is velocity-space arithmetic:

v_anchor   = v_θ(·; H_arb)         # H_arb = SpeakerEncoder(source_audio), live
v_c        = v_θ(·; H_c)
v_n        = v_θ(·; H_neut)
v_uncond_t = v_θ(·; H_arb, text_mask=0)

v_guided = v_anchor + w_text · (v_anchor − v_uncond_t)
                    + w_style · (v_c − v_n)

The (v_c − v_n) term is the "character-ness, performer-removed" direction. Added to the source speaker's anchor it transfers the style without dragging in the original performer's vocal identity.

from echolite.transfer import generate_with_style_transfer

audio, debug = generate_with_style_transfer(
    bundle,
    source_audio_path="reference_speaker.wav",
    H_c=art_c.H_spk, mask_c=art_c.mask,
    H_neut=art_neut.H_spk, mask_neut=art_neut.mask,
    transcript="Now what's gonna be? Are you in or out?",
    w_style=1.0, w_text=3.0,
)

This is the research-heavier of the two end goals — when w_style is too high, prosody can warp; too low and the source speaker dominates. Tune per source on held-out generations. See reports/phase_10__speaker_transfer.md and reports/applying_facilier_to_arbitrary_characters.md for the recipe + listening-test findings.


Sanity checks (always run on a new character)

Two cheap diagnostics catch ~all setup bugs before you spend GPU hours on a full inversion. From CLAUDE.md "Sanity Checks":

# 1. Initialization-only baseline — does the init alone already sound right?
echolite check-init \
  --character new_char \
  --clips characters/new_char/clips_manifest.jsonl \
  --out-dir characters/new_char/_sanity

# 2. Single-clip overfit — can we reconstruct one clip from itself in 500 steps?
echolite check-overfit \
  --character new_char \
  --clip <one_clip_id_from_manifest> \
  --clips characters/new_char/clips_manifest.jsonl \
  --out-dir characters/new_char/_sanity \
  --steps 500

A failed init-only check usually means the init pipeline is broken (wrong audio path, dtype mismatch, padding bug). A failed single-clip overfit means optimization itself is broken (most often: requires_grad set on the wrong tensor). Both passing → proceed to the full invert run.

During a full run, the two diagnostics to watch are per-clip held-out reconstruction loss (primary early-stopping signal) and cosine(H_spk, H_spk^(0)) (≈1.0 → optimization barely moved; ≈0.0 → drifted off the manifold). Both are emitted to train_curve.jsonl and printed by the CLI.


Per-character artifact

The unit of distribution. ~3.3 MB fp32, 1.6 MB bf16 at T_spk=640, d_spk=1280. One tensor per character; vectors do not transfer between characters.

characters/<id>/
├── H_spk.safetensors       # tensor + padding mask
├── sidecar.yaml            # init manifest, hyperparams, audio sha256s, backbone version
├── clips_manifest.jsonl    # source clips with transcripts
├── train_curve.jsonl       # per-step loss / cosine / grad-norm
└── notes.md                # qualitative voice notes, deviations from defaults

Load it from anywhere with echolite.artifact.load(path); substitute into Echo with echolite.inference.generate(...). Nothing else in the Echo deployment changes.


Where things are documented


License

Internal — see repository settings.

About

Repo for applying KV Inversion for Echo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages