Skip to content

[feat] Add JoyAI-Echo multi-shot audio-video generation pipeline#13910

Open
sjq66 wants to merge 1 commit into
huggingface:mainfrom
sjq66:feature/joyai-echo
Open

[feat] Add JoyAI-Echo multi-shot audio-video generation pipeline#13910
sjq66 wants to merge 1 commit into
huggingface:mainfrom
sjq66:feature/joyai-echo

Conversation

@sjq66

@sjq66 sjq66 commented Jun 10, 2026

Copy link
Copy Markdown

What does this PR do?

We are the JoyAI Team (JD.com), and this is the Diffusers implementation for the JoyAI-Echo model.

Fixes #13909

Model Overview

JoyAI-Echo is a unified framework for long-form audio-visual generation that supports minute-level multi-shot video creation with synchronized audio, strong temporal consistency, and real-time interaction.

Key Features

  • 🎞️ Minute-level multi-shot stories: generate a sequence of coherent shots from a list of prompts
  • DMD-distilled few-step inference: ~7.5× faster than the original pipeline
  • 🔊 Joint audio-video generation: one pipeline produces synchronized video and audio
  • 🧠 Paired cross-modal memory bank: conditions each new shot on prior visual identity and voice context for story-level consistency

Implementation Details

New files added:

  • src/diffusers/models/transformers/transformer_joyai_echo.pyJoyAIEchoTransformer3DModel, extends LTX2VideoTransformer3DModel with memory mask support for multi-shot generation
  • src/diffusers/pipelines/joyai_echo/pipeline_joyai_echo.pyJoyAIEchoPipeline, multi-shot pipeline with paired audio-video memory bank
  • src/diffusers/pipelines/joyai_echo/pipeline_joyai_echo_original_checkpoint.pyJoyAIEchoOriginalCheckpointPipeline for loading original checkpoints
  • src/diffusers/pipelines/joyai_echo/pipeline_output.py — Output dataclasses
  • docs/source/en/api/pipelines/joyai_echo.md — API documentation
  • tests/pipelines/joyai_echo/test_joyai_echo.py — Pipeline tests

Usage example:

import torch
from diffusers import JoyAIEchoPipeline
from diffusers.utils import encode_video

pipe = JoyAIEchoPipeline.from_pretrained("jdopensource/JoyAI-Echo", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

output = pipe(
    [
        "A cinematic opening shot of the protagonist entering a quiet train station.",
        "The same protagonist speaks softly while the camera follows through the platform.",
    ],
    height=736,
    width=1280,
    num_frames=241,
    frame_rate=25.0,
)

for i, (frames, audio) in enumerate(zip(output.frames, output.audio)):
    encode_video(frames[0], fps=25, audio=audio[0].float().cpu(), output_path=f"shot_{i:03d}.mp4")

Tests

$ python -m pytest tests/pipelines/joyai_echo/ -v
tests/pipelines/joyai_echo/test_joyai_echo.py::JoyAIEchoPipelineFastTests::test_multishot_memory_prefix PASSED [100%]
======================== 1 passed in 6.54s =========================

Before submitting

Who can review?

Add JoyAI-Echo, a unified framework for long-form audio-visual generation
that supports minute-level multi-shot video creation with synchronized audio.

New files:
- JoyAIEchoTransformer3DModel: audiovisual transformer with memory mask support
- JoyAIEchoPipeline: multi-shot pipeline with paired audio-video memory bank
- JoyAIEchoOriginalCheckpointPipeline: loading from original checkpoints
- Pipeline tests and API documentation

Key features:
- Minute-level multi-shot stories from prompt sequences
- DMD-distilled few-step inference (~7.5x faster)
- Joint audio-video generation in one pipeline
- Paired cross-modal memory bank for story-level consistency

Reference: https://github.com/jd-opensource/JoyAI-Echo
Weights: https://huggingface.co/jdopensource/JoyAI-Echo
@github-actions github-actions Bot added fixes-issue documentation Improvements or additions to documentation models tests pipelines size/L PR with diff > 200 LOC and removed fixes-issue labels Jun 10, 2026
@yiyixuxu

Copy link
Copy Markdown
Collaborator

@askserge can you do a initial review here?

@yiyixuxu

Copy link
Copy Markdown
Collaborator

@claude can you do a review here?

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Claude finished @yiyixuxu's task in 3m 41s —— View job


Review of PR #13910: [feat] Add JoyAI-Echo multi-shot audio-video generation pipeline

  • Read repository review guidelines (AGENTS.md, models.md, pipelines.md, modular.md)
  • Read transformer model code
  • Read pipeline code
  • Read original checkpoint pipeline code
  • Read pipeline output and init files
  • Read test code
  • Trace call paths and check for dead code
  • Post comprehensive review

Blocking Issues

1. JoyAIEchoTransformer3DModel is a near-complete copy of its parent — and never exercised

transformer_joyai_echo.py subclasses LTX2VideoTransformer3DModel and overrides the entire forward() method (~200 lines) to add three mask parameters (audio_self_attention_mask, a2v_cross_attention_mask, v2a_cross_attention_mask). Two problems:

  • The pipeline never passes any of these masks. In _denoise_shot (line 380–405), the transformer is called without audio_self_attention_mask, a2v_cross_attention_mask, or v2a_cross_attention_mask — they all default to None. The subclass is dead code. Fix this →

  • Fragile duplication. The entire parent forward method is copy-pasted. Any future change to LTX2VideoTransformer3DModel.forward() will silently diverge. The better approach is to add these three parameters to the parent's forward() signature (the transformer blocks already accept them — the parent just hardcodes None), removing the need for a subclass entirely.

Per AGENTS.md: "No defensive code, unused code paths, or legacy stubs."

2. JoyAIEchoOriginalCheckpointPipeline should not be merged

This pipeline (pipeline_joyai_echo_original_checkpoint.py):

  • Manipulates sys.path to import from an external repo checkout (ltx-core/src, ltx-pipelines/src, ltx-distillation/src) — this breaks when the external repo isn't present
  • Imports non-diffusers packages (ltx_distillation.*, torchaudio) without is_X_available() guards
  • Writes files to disk in __call__ (JSON metadata, MP4, WAV) — diffusers pipelines return outputs, they don't write files
  • Hardcodes developer paths and imports from the reference repo — per review rules, this is a parity testing harness, not a pipeline

This file belongs in scripts/ or the upstream repository, not in src/diffusers/.

3. Pipeline bypasses the scheduler

JoyAIEchoPipeline.__call__ implements its own denoising loop with explicit sigma scheduling (line 353–432), never calling self.scheduler.set_timesteps() or self.scheduler.step(). Per pipelines.md gotcha #3:

"Don't do this — bypassing the scheduler entirely and rolling your own step"

The scheduler is listed as _optional_components but is never used — it's dead code. The flow matching Euler step logic in _denoise_shot (velocity-to-x0 conversion at line 410–411, noise addition at line 418–419) duplicates what FlowMatchEulerDiscreteScheduler.step() already does. Either integrate with the scheduler or remove it entirely.

4. _build_video_memory_attention_mask is defined but never called

pipeline_joyai_echo.py:206 defines _build_video_memory_attention_mask, but it's never invoked anywhere. In _denoise_shot, video_attention_mask is initialized as None and stays None through the entire loop. If memory masks are needed for correctness, they should be built and passed; if not, the method should be removed.

5. Missing check_inputs

The pipeline's __call__ performs no input validation — no height/width divisibility check, no prompt type validation, no validation of denoising_sigmas shape beyond 1D check. Other diffusers pipelines (including the parent LTX2Pipeline) validate inputs thoroughly. At minimum, validate height/width and prompt type.


Non-blocking Issues

6. Mutable default argument

# pipeline_joyai_echo.py:457
callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"]

Mutable default arguments are a well-known Python antipattern. Use None and set the default inside the method body.

7. batch_size = 1 is hardcoded

In __call__ (line 497), batch_size is always 1, meaning the pipeline doesn't support batch generation. This should be documented in the docstring, or batch support should be added.

8. Custom _get_execution_device is unnecessary

The override at line 253–266 duplicates what DiffusionPipeline._execution_device already provides. The try/except is odd — _execution_device is a property that always exists. Remove this override and use self._execution_device directly.

9. JoyAIEchoMemoryBank and JoyAIEchoMemorySlot export inconsistency

These classes are exported from pipelines/joyai_echo/__init__.py but NOT from diffusers/pipelines/__init__.py or diffusers/__init__.py. Either export them consistently or keep them private (they're internal implementation details).

10. Missing # Copied from annotations

_decode_latents borrows heavily from LTX2Pipeline decode logic but has no # Copied from annotation. Per pipelines.md gotcha #5, this means future refactors to the source will diverge silently.

11. Test coverage is minimal

Only one test (test_multishot_memory_prefix) exists, checking token counts and output shapes. No single-shot test, no error case tests, no test for output_type="latent". The test uses fake components that bypass actual computation — while this is fine for a fast test, there should be more coverage.


Suggestions / Additional Info: Dead Code Trace

I traced the call path from JoyAIEchoPipeline.__call___denoise_shotself.transformer(...) and identified the following likely dead code:

Code Status Reason
JoyAIEchoTransformer3DModel.forward() mask params Dead Pipeline never passes audio_self_attention_mask, a2v_cross_attention_mask, v2a_cross_attention_mask
_build_video_memory_attention_mask Dead Defined but never called anywhere
self.scheduler Dead Listed as optional component, never used in __call__
_add_flow_noise Used Called in _denoise_shot
_repeat_token_timestep Used Called in _denoise_shot
_select_memory_video_tokens Used Called in __call__
_select_memory_audio_tokens Used Called in __call__
JoyAIEchoOriginalCheckpointPipeline Should remove Parity testing harness, not a diffusers pipeline

Note: The transformer config could in theory differ from defaults, but under the default call path from this pipeline, none of the memory mask parameters are used.


Summary

The core idea of a multi-shot memory-conditioned audio-video pipeline is sound and aligns well with the LTX2 ecosystem. However, several structural issues need attention before merge:

  1. The transformer subclass duplicates the parent's forward and is never exercised — either wire up the masks or remove the subclass
  2. The original checkpoint pipeline is a parity harness, not a diffusers pipeline — remove it
  3. The scheduler is bypassed — integrate with the scheduler or remove it
  4. Add input validation (check_inputs)
  5. Remove dead code (_build_video_memory_attention_mask)

@sergereview sergereview Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

Initial Review — JoyAI-Echo Pipeline

This PR has several significant structural and correctness issues that need to be addressed before it can be merged.

Architecture: Pipeline subclassing violates repo conventions

JoyAIEchoPipeline subclasses LTX2Pipeline — this directly violates the convention in pipelines.md gotcha #4: "Don't use an existing pipeline class to override another inside the core src/ codebase. Each pipeline lives in its own file with its own class." The pipeline should be a standalone class that reuses shared utilities (prep functions, etc.) but not the pipeline class itself.

Correctness: Scheduler is bypassed entirely

_denoise_shot rolls its own Euler step via _add_flow_noise and never calls self.scheduler.step(). This is exactly what pipelines.md gotcha #3 warns against: "Reimplementing what the scheduler already does." The scheduler is registered as an optional component but never used — the denoising loop should use the scheduler.

Correctness: Memory attention masks are never built or passed

The entire justification for JoyAIEchoTransformer3DModel (subclassing LTX2VideoTransformer3DModel) is to pass through audio_self_attention_mask, a2v_cross_attention_mask, and v2a_cross_attention_mask. However:

  • The pipeline never passes these three mask arguments to the transformer.
  • _build_video_memory_attention_mask is defined but never called — it's dead code.
  • video_attention_mask in _denoise_shot is always None (set on line 361, never reassigned).

This means the transformer subclass adds no value over the parent, and the memory bank only contributes latent tokens (via concatenation) without any structural attention masking. Either the masks need to be built and passed, or the subclass should be removed.

Correctness: Transformer subclass is a near-verbatim copy of the parent

The 287-line JoyAIEchoTransformer3DModel.forward() is a copy-paste of LTX2VideoTransformer3DModel.forward() with the only difference being that it passes through the mask parameters instead of hardcoding None. Per AGENTS.md: "No defensive code, unused code paths". If the masks are needed, a cleaner approach would be to make the parent class accept them (it already does at the block level). If they're not needed, delete the subclass.

JoyAIEchoOriginalCheckpointPipeline should not be shipped

This class manipulates sys.path to import from the original research repo (ltx-core/src, ltx-pipelines/src, ltx-distillation/src), has hardcoded developer paths, and is essentially a parity harness. Per review rules, parity harnesses and comparison scripts with imports from the reference repo are ephemeral context that shouldn't be in the final PR.

Dead code and unnecessary overrides

  • _get_execution_device() is redundant — _execution_device is a @property on DiffusionPipeline and will never raise AttributeError. The entire method body after return self._execution_device is dead code.
  • _build_video_memory_attention_mask is never called.
  • _as_prompt_list is a trivial one-liner that could be inlined.

Tests

  • Only one test (test_multishot_memory_prefix) with dummy components that bypass all real logic. The test doesn't exercise the actual JoyAIEchoTransformer3DModel at all.
  • tests/pipelines/joyai_echo/__init__.py has imports and __all__ — test __init__.py files in this repo are empty.
  • No check_inputs validation method on the pipeline.

Missing model class attributes

While JoyAIEchoTransformer3DModel inherits _no_split_modules, _repeated_blocks, etc. from the parent, it should be verified that these are correct for the subclass (e.g., _repeated_blocks references LTX2VideoTransformerBlock which is fine since the subclass doesn't add new block types).

Suggestions / Dead Code Analysis

Under the default config and the pipeline's call path:

  • _build_video_memory_attention_mask is unreachable (never called).
  • The audio_self_attention_mask, a2v_cross_attention_mask, v2a_cross_attention_mask parameters in the transformer's forward are always None because the pipeline never passes them.
  • The video_attention_mask variable in _denoise_shot is always None.

model: claude-opus-4-6 · 35 LLM turns · 42 tool calls · 226.0s · 2117185 in / 8068 out tokens

return audio_latents[:, start:end].contiguous(), audio_coords[:, :, start:end].contiguous()


class JoyAIEchoPipeline(LTX2Pipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclassing LTX2Pipeline violates the repo convention in pipelines.md gotcha #4: "Don't use an existing pipeline class to override another inside the core src/ codebase. Each pipeline lives in its own file with its own class, even if it shares 90% of __call__ with a sibling."

This should be a standalone pipeline class inheriting from DiffusionPipeline (+ mixins), reusing shared utilities but not the pipeline class itself.

audio_noise = randn_tensor(
audio_latents.shape, generator=generator, device=device, dtype=audio_latents.dtype
)
latents = self._add_flow_noise(pred_video, video_noise, next_sigma).to(dtype=dtype)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The denoising loop bypasses the scheduler entirely and rolls its own Euler step via _add_flow_noise. This is exactly what pipelines.md gotcha #3 warns against:

"don't do this — bypassing the scheduler entirely and rolling your own step"

The scheduler should own the step logic. If the DMD sigma schedule requires a specific scheduler configuration, configure the scheduler accordingly rather than reimplementing the step.

audio_model_input = audio_latents
video_model_coords = video_coords
audio_model_coords = audio_coords
video_attention_mask = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

video_attention_mask is set to None here and never reassigned, even when memory tokens are prepended. This means video_self_attention_mask is always None in the transformer call (line 403). The _build_video_memory_attention_mask method (line 206) is defined but never called — it's dead code.

Without attention masks, memory tokens attend freely to target tokens and vice versa with no structural constraint. Is this intentional? If so, the _build_video_memory_attention_mask method and the JoyAIEchoTransformer3DModel subclass (which exists solely to pass through mask parameters) should be removed.

logger = logging.get_logger(__name__)


class JoyAIEchoTransformer3DModel(LTX2VideoTransformer3DModel):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 287-line forward is a near-verbatim copy of LTX2VideoTransformer3DModel.forward(). The only difference is passing through audio_self_attention_mask, a2v_cross_attention_mask, and v2a_cross_attention_mask instead of hardcoding None. But the pipeline never passes these masks — they're always None.

Since the block-level forward already accepts these parameters, a cleaner approach (if the masks are actually needed) would be to modify the parent's forward to accept and pass them through, rather than duplicating the entire method. If the masks aren't needed, this subclass should be deleted entirely and the pipeline should use LTX2VideoTransformer3DModel directly.

from ..pipeline_utils import DiffusionPipeline


class JoyAIEchoOriginalCheckpointPipeline(DiffusionPipeline):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class manipulates sys.path to import from the original research repo (ltx-core/src, ltx-pipelines/src, ltx-distillation/src), has hardcoded developer paths, and is essentially a parity harness. Per the review rules, parity harnesses and comparison scripts with imports from the reference repo are ephemeral context and should not be shipped in the final PR. This file should be removed.


return self.connectors(prompt_embeds, prompt_attention_mask, padding_side=tokenizer_padding_side)

def _get_execution_device(self) -> torch.device:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is unnecessary. _execution_device is a @property on DiffusionPipeline — it will never raise AttributeError, so the except branch and the fallback loop are dead code. The entire method reduces to return self._execution_device. Just use self._execution_device directly at the call site (line 496) and delete this method.

return sigma

@staticmethod
def _build_video_memory_attention_mask(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code — this method is defined but never called anywhere in the pipeline.

@@ -0,0 +1,4 @@
from .test_joyai_echo import JoyAIEchoPipelineFastTests

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test __init__.py files in this repo are empty (see tests/pipelines/ltx2/__init__.py and others). Remove the imports and __all__.

Suggested change
from .test_joyai_echo import JoyAIEchoPipelineFastTests

transformer_outputs_x0: bool = True,
attention_kwargs: dict[str, Any] | None = None,
callback_on_step_end: Callable[[Any, int, torch.Tensor, dict], dict] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable default argument. This should be None with a default set inside the method body:

Suggested change
callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"],
callback_on_step_end_tensor_inputs: list[str] | None = None,

Then inside the method: callback_on_step_end_tensor_inputs = callback_on_step_end_tensor_inputs or ["latents", "audio_latents"]

if callback_on_step_end is not None:
callback_kwargs = {}
for name in callback_on_step_end_tensor_inputs or []:
callback_kwargs[name] = locals()[name]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using locals() to populate callback kwargs is fragile — if a variable is renamed or shadowed, this silently breaks. Prefer explicit dict construction:

callback_kwargs = {}
for name in callback_on_step_end_tensor_inputs or []:
    if name == "latents":
        callback_kwargs[name] = latents
    elif name == "audio_latents":
        callback_kwargs[name] = audio_latents
    elif name == "prompt_embeds":
        callback_kwargs[name] = prompt_embeds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation models pipelines size/L PR with diff > 200 LOC tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Pipeline/Model] Add JoyAI-Echo multi-shot audio-video generation pipeline

2 participants