HunyuanVideo Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

HunyuanVideo

HunyuanVideo: A Systematic Framework For Large Video Generation Model

HunyuanVideo Overview and System Architecture

Related topics: Inference Workflows and Deployment Modes, Core Model Components and Diffusion Pipeline

Section Related Pages

Continue reading this section for the full explanation and source context.

Section DiT Backbone

Continue reading this section for the full explanation and source context.

Section Text Encoding and Prompt Templates

Continue reading this section for the full explanation and source context.

Section Causal 3D VAE

Continue reading this section for the full explanation and source context.

HunyuanVideo Overview and System Architecture

Purpose and Scope

HunyuanVideo is an open-source text-to-video (T2V) generation system released by Tencent under the Tencent Hunyuan Community License Agreement (Source: LICENSE.txt:1-25). The repository provides the model weights, inference code, and configuration utilities needed to run high-resolution video synthesis on multi-GPU setups. As of this writing, the public release centers on the T2V checkpoint family, while the image-to-video (I2V) variant is on the public roadmap — a recurring point of community interest tracked in issues #128, #131, #172, #180, and #198.

The repository is organized so that pretrained weights are stored under HunyuanVideo/ckpts/ in a strict directory layout containing hunyuan-video-t2v-720p/transformers, vae, text_encoder, and text_encoder_2 (Source: ckpts/README.md:1-10). A second community-supported MLLM path, llava-llama-3-8b-v1_1-transformers, can be preprocessed into the text_encoder directory to save GPU memory (Source: ckpts/README.md:35-50).

System Architecture

The inference stack is a multi-stage latent diffusion pipeline. The user prompt is first optionally rewritten by an LLM in hyvideo/prompt_rewrite.py (Normal or Master mode), then encoded by two frozen text encoders, diffused in latent space by a DiT-style transformer, and finally decoded back to pixel space by a causal 3D VAE. A flow-matching scheduler governs the denoising trajectory.

flowchart LR
    A[User Prompt] --> B[Prompt Rewriter<br/>hyvideo/prompt_rewrite.py]
    B --> C[Text Encoder 1<br/>MLLM / LLaMA]
    A --> D[Text Encoder 2<br/>CLIP ViT-L/14]
    C --> E[HYVideoDiffusionTransformer<br/>hyvideo/modules/models.py]
    D --> E
    E --> F[FlowMatchDiscreteScheduler<br/>scheduling_flow_match_discrete.py]
    F --> G[Causal 3D VAE<br/>autoencoder_kl_causal_3d.py]
    G --> H[Output Video Frames]

The pipeline entry point is HunyuanVideoPipeline, a subclass of DiffusionPipeline that wires together a VAE, two text encoders, a HYVideoDiffusionTransformer, and a KarrasDiffusionSchedulers-compatible scheduler (Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:10-30). It defines an explicit model_cpu_offload_seq of "text_encoder->text_encoder_2->transformer->vae" and excludes the transformer from CPU offload, indicating that the transformer is the most memory-hungry component (Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:32-35).

Key Components

DiT Backbone

HYVideoDiffusionTransformer is registered via @register_to_config and constructed by load_model in hyvideo/modules/__init__.py, which dispatches on the args.model key against the HUNYUAN_VIDEO_CONFIG registry (Source: hyvideo/modules/__init__.py:1-20). The default inference model is HYVideo-T/2-cfgdistill (Source: hyvideo/config.py:30-40). The transformer is composed of MMDoubleStreamBlock and MMSingleStreamBlock modules, mirroring the SD3 / Flux design where text and video tokens receive separate modulation before being fused (Source: hyvideo/modules/models.py:18-80). A SingleTokenRefiner block refines text token embeddings before they enter the joint attention layers.

Text Encoding and Prompt Templates

Two text encoders are loaded. The first is a decoder-only MLLM (default llava-llama-3-8b-v1_1-transformers), which requires a Llama-3 chat template to be applied via PROMPT_TEMPLATE_ENCODE (Source: hyvideo/constants.py:20-40). A dedicated PROMPT_TEMPLATE_ENCODE_VIDEO template instructs the same MLLM to describe videos across five aspects — content/theme, color/shape/spatial relations, actions/temporal changes, environment/style, and camera angles (Source: hyvideo/constants.py:28-50). The second encoder is openai/clip-vit-large-patch14, which provides an alternative CLIP embedding (Source: ckpts/README.md:25-30). A shared negative prompt is hard-coded: "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion" (Source: hyvideo/constants.py:50-55).

Causal 3D VAE

The decoder is a 3D causal VAE (AutoencoderKLCausal3D) modified from diffusers==0.29.2, supporting a fall-back import path against either the patched or upstream diffusers.loaders API (Source: hyvideo/vae/autoencoder_kl_causal_3d.py:15-30). The "causal" design preserves temporal causality across video frames, which is essential for coherent T2V outputs.

Flow-Matching Scheduler

Sampling is driven by FlowMatchDiscreteScheduler, derived from Stability AI's flow-matching implementation and adapted from diffusers==0.29.2 (Source: hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py:1-30). It exposes a step method that returns a FlowMatchDiscreteSchedulerOutput with prev_sample for iterative denoising.

Configuration and Usage Patterns

Argument parsing is centralized in hyvideo/config.py, which groups CLI flags into network, extra-models, denoise-schedule, inference, and parallel sections. The network group exposes --model, --latent-channels, --precision (fp32/fp16/bf16, default bf16), and --rope-theta (Source: hyvideo/config.py:25-55). Supported precisions, normalization types, and activation types are whitelisted as PRECISIONS, NORMALIZATION_TYPE, and ACTIVATION_TYPE sets in hyvideo/constants.py (Source: hyvideo/constants.py:1-20). The --flow-reverse flag and the --ulysses-degree / --ring-degree flags used in community-reported multi-GPU recipes (e.g. issue #249) are defined in the parallel-args group.

A typical inference command is torchrun --nproc_per_node=N sample_video.py --video-size 1280 720 --video-length 129 --infer-steps 50 --prompt "..." --flow-reverse --seed 42 --ulysses-degree N (Source: hyvideo/inference.py and issue #249). This shows the pipeline is designed for distributed execution: a torchrun launcher combined with sequence-parallel strategies (Ulysses, Ring) sharding the HYVideoDiffusionTransformer across devices.

Community Engagement and Roadmap

The community consistently raises three themes: (1) the I2V release date — tracked in issues #128, #131, #172, #180, and #198; (2) the absence of an official fine-tuning script, flagged in issue #302; and (3) installation/parallel-inference pitfalls such as the CUDA 12.4 / cuBLAS 12.4.5.8 / cuDNN 9.0 requirement discussed in issue #317. Users encountering multi-GPU hangs typically need to verify these driver versions before assuming a code defect.

Inference Workflows and Deployment Modes

Related topics: HunyuanVideo Overview and System Architecture, Core Model Components and Diffusion Pipeline, Community Roadmap, Troubleshooting, and Known Issues

Section Related Pages

Continue reading this section for the full explanation and source context.

Inference Workflows and Deployment Modes

Overview

HunyuanVideo ships multiple entry points that share a common inference core defined in hyvideo/inference.py (the HunyuanVideoSampler class). That core orchestrates three subsystems — a text encoder, a Multimodal DiT denoiser, and a causal 3D VAE — and drives them with a flow-matching discrete scheduler from hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py. Around this core, the repository exposes four deployment modes: a CLI single-GPU runner (sample_video.py), a multi-GPU sequence-parallel runner launched with torchrun, an FP8-quantized low-VRAM variant, and a Gradio web UI (gradio_server.py).

The shared pipeline is implemented in hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py and accepts pre-computed prompt_embeds/negative_prompt_embeds, raw latents, a generator, and guidance_rescale — the same surface used by every entry point below.

Single-GPU CLI Inference

The canonical entry point is sample_video.py, which parses command-line arguments, builds a HunyuanVideoSampler, and calls its predict(...) method. Default arguments are declared in hyvideo/config.py, where key flags include --model-base (root of the ckpts/ tree), --dit-weight (defaults to ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt), --model-resolution (540p or 720p), --use-cpu-offload, and --load-key (module or ema). Resolution is also tied to model selection: sample_video.py automatically picks the matching DiT checkpoint and VAE for the requested model-resolution.

The reference launcher scripts/run_sample_video.sh invokes:

python sample_video.py \
    --prompt "A cat walks on the grass, realistic style." \
    --video-size 1280 720 --video-length 129 --infer-steps 50 \
    --flow-reverse --seed 42 --ulysses-degree 1 --ring-degree 1

Inside HunyuanVideoSampler.predict, the prompt is routed through TextEncoder (hyvideo/text_encoder/__init__.py), which selects PROMPT_TEMPLATE["dit-llm-encode-video"] from hyvideo/constants.py and applies the LLaMA-style chat template (PROMPT_TEMPLATE_ENCODE_VIDEO) before tokenization. Negative prompts default to the string in hyvideo/constants.py (NEGATIVE_PROMPT = "Aerial view, aerial view, overexposed, low quality, ..."). The resulting embeddings, together with flow_shift, guidance_scale, and embedded_guidance_scale, are passed to self.pipeline(...), whose data_type argument is automatically set to "video" when target_video_length > 1 else "image".

Multi-GPU and Sequence-Parallel Inference

For multi-GPU runs the project provides scripts/run_sample_video_multigpu.sh, which wraps the same script under torchrun:

torchrun --nproc_per_node=$NGPU sample_video.py \
    --video-size 1280 720 --video-length 129 --infer-steps 50 \
    --prompt "..." --flow-reverse --seed 42 \
    --ulysses-degree $ULYSSES_DEGREE --ring-degree $RING_DEGREE

The --ulysses-degree and --ring-degree flags enable DeepSpeed Ulysses sequence parallelism and ring attention respectively; the model enables them by calling parallel_attention from hyvideo/modules/attenion.py when those degrees are greater than one. The DiT block in hyvideo/modules/models.py exposes MMDoubleStreamBlock and MMSingleStreamBlock layers (20 + 40 by default) that participate in distributed attention.

This is the same code path users hit in community issue #249, where parallel inference failed mid-run. The most common root causes reported there are (a) mismatched CUDA / cuBLAS versions (the project requires nvidia-cublas-cu12==12.4.5.8 plus LD_LIBRARY_PATH pointed at the conda cublas libs, or the bundled CUDA 12 Docker image — see issue #317), and (b) --ulysses-degree / --ring-degree not evenly dividing the head count. The model has heads_num=24 (see HunyuanVideo constructor in hyvideo/modules/models.py), so valid ulysses-degree * ring-degree combinations must divide 24.

FP8 Quantized Deployment

To reduce VRAM, the ckpts/ tree ships an FP8-everything checkpoint in addition to the bf16 weight. The launcher scripts/run_sample_video_fp8.sh activates it with:

--dit-weight ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt \
--dit-weight-map ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8_map.pt

The sampler reads the FP8 weight together with its scale map, dequantizes on the fly, and reuses the same pipeline as the bf16 path — there is no separate model architecture for the FP8 variant. Per ckpts/README.md, both formats live under ckpts/hunyuan-video-t2v-720p/transformers/ alongside the VAE and the dual text encoders (text_encoder, text_encoder_2).

Gradio Web UI

gradio_server.py provides a browser interface that wraps HunyuanVideoSampler. It exposes resolution presets (1280x720, 720x1280, 1104x832, etc.), inference steps, guidance scale, embedded guidance scale, flow shift, and a negative-prompt textbox. On submit it calls infer(...) with num_videos_per_prompt=1, batch_size=1, and writes the resulting video via save_videos_grid to gradio_outputs/<timestamp>_seed<seed>_<prompt-slug>.mp4 at 24 fps. The default seed prompt is "A cat walks on the grass, realistic style.".

End-to-End Data Flow

flowchart LR
    A[Prompt] --> B[TextEncoder\nLLaMA-style template]
    N[Negative Prompt] --> B
    B --> C[Prompt / Neg Embeddings]
    C --> D[DiT Denoiser\nFlow-Match Scheduler]
    Z[Random Latents] --> D
    D --> E[Clean Latents]
    E --> F[Causal 3D VAE\nDecode]
    F --> G[Video Frames / Image]
    G --> H[save_videos_grid → MP4]

Configuration Reference

Flag	Default	Purpose
`--model-base`	`ckpts`	Root of all model weights.
`--dit-weight`	`.../mp_rank_00_model_states.pt`	DiT checkpoint (bf16 or FP8).
`--model-resolution`	`540p`	Selects 540p or 720p DiT + VAE.
`--use-cpu-offload`	off	Offload DiT/VAE/TextEncoder to CPU.
`--load-key`	`module`	`module` (weights) or `ema` (EMA copy).
`--ulysses-degree`	`1`	Ulysses sequence-parallel degree.
`--ring-degree`	`1`	Ring-attention degree.
`--flow-shift`	per resolution	Shifts the flow-matching sigma schedule.
`--infer-steps`	—	Number of denoising iterations.
`--video-length` / `--video-size`	—	Output frame count and `(H, W)`.

Common Failure Modes

cuBLAS / CUDA mismatch — Issue #317 documents the need for nvidia-cublas-cu12==12.4.5.8 and a matching LD_LIBRARY_PATH, or the official CUDA 12 Docker image.
Parallel-inference crash on multi-GPU — Issue #249 traces the failure to improper setup of --ulysses-degree / --ring-degree; both must be 1 for single-GPU runs.
Missing checkpoints — ckpts/README.md requires huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts; the sampler raises if text_encoder, text_encoder_2, vae, or the DiT weights cannot be resolved under --model-base.

Core Model Components and Diffusion Pipeline

Related topics: HunyuanVideo Overview and System Architecture, Inference Workflows and Deployment Modes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MMDoubleStreamBlock

Continue reading this section for the full explanation and source context.

Section MMSingleStreamBlock

Continue reading this section for the full explanation and source context.

Section Text Encoders

Continue reading this section for the full explanation and source context.

Core Model Components and Diffusion Pipeline

Overview

The HunyuanVideo repository implements a text-to-video (T2V) generation system whose center of gravity is a multimodal Diffusion Transformer (DiT) combined with a causal 3D VAE, dual text encoders, and a flow-matching Euler scheduler wrapped in a Diffusers-style pipeline. The model definition lives in hyvideo/modules/models.py, the inference orchestration in hyvideo/inference.py, and the diffusion loop in hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py. The design intentionally mirrors SD3 and Flux.1, but introduces a 3D-aware video latent path and a "flow-matching" discrete scheduler instead of classical DDPM.

The pipeline expects four major sub-networks to be loaded from ckpts/ (see ckpts/README.md): the DiT transformer (hunyuan-video-t2v-720p/transformers/), the causal 3D VAE (hunyuan-video-t2v-720p/vae/), and two text encoders (text_encoder for CLIP-L, text_encoder_2 for the LLM). A high-level view of the runtime data flow is shown below.

flowchart LR
    P[Prompt] --> TE1[CLIP-L Text Encoder]
    P --> TE2[LLM Text Encoder + Refiner]
    TE1 --> Proj[Text Projection]
    TE2 --> Proj
    Proj --> DiT[HYVideoDiffusionTransformer]
    N[Random Noise] --> Sched[Flow-Match Scheduler]
    Sched --> DiT
    DiT --> Latent[Video Latent]
    Latent --> VAE[Causal 3D VAE Decoder]
    VAE --> Out[Output Video / Image]

Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:18-37, hyvideo/inference.py:53-145.

Core DiT Model

The HYVideoDiffusionTransformer defined in hyvideo/modules/models.py is a dual-stream then single-stream DiT. The constructor accepts mm_double_blocks_depth=20 and mm_single_blocks_depth=40 by default, producing 20 multimodal double-stream blocks followed by 40 single-stream blocks. Key initialization arguments are summarized below.

Argument	Default	Role
`patch_size`	`[1, 2, 2]`	3D patch size: 1 along temporal axis, 2×2 spatially
`in_channels`	`4`	Latent channels from the VAE
`hidden_size`	`3072`	Transformer hidden width
`heads_num`	`24`	Attention heads
`mlp_width_ratio`	`4.0`	MLP expansion factor
`rope_dim_list`	`[16, 56, 56]`	RoPE split across T, H, W axes
`text_projection`	`"single_refiner"`	Token refiner configuration for text features
`guidance_embed`	`False`	Reserved for distillation guidance
`use_attention_mask`	`True`	Pad-mask text tokens during attention

Source: hyvideo/modules/models.py:80-110.

MMDoubleStreamBlock

MMDoubleStreamBlock runs two parallel streams — one for visual tokens and one for text tokens — and only lets them interact through a joint attention operation. The visual path applies img_mod modulation; the text path applies txt_mod. Both projections produce QKV plus an MLP gate (mlp_in), and the class explicitly cites SD3 (arXiv:2403.03206) and Flux.1 as design references. RoPE is applied through apply_rotary_emb from hyvideo/modules/posemb_layers.py using the per-axis rope_dim_list.

Source: hyvideo/modules/models.py:21-78.

MMSingleStreamBlock

MMSingleStreamBlock collapses the two streams into one. It uses a fused linear1 projection that emits 3 * hidden_size + mlp_hidden_dim channels (QKV + MLP input in a single matmul) and a fused linear2 that combines attention output and MLP output. This pattern is the same "parallel linear layers" trick used in arXiv:2302.05442. QK normalization (qk_norm=True, qk_norm_type="rms") is applied before scaled-dot-product attention, with qk_scale = head_dim ** -0.5.

Source: hyvideo/modules/models.py:130-185.

Diffusion Pipeline and Scheduler

The pipeline is a thin wrapper around Diffusers' DiffusionPipeline, with the offload sequence declared as "text_encoder->text_encoder_2->transformer->vae" and the transformer explicitly excluded from CPU offload (_exclude_from_cpu_offload = ["transformer"]). Optional components include text_encoder_2 (the LLM is optional in this construction). The pipeline accepts prompts, negative prompts, latent noise, guidance scale, and an embedded_guidance_scale that is forwarded into the denoising loop.

Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:18-65, hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:80-120.

The scheduler is a custom flow-matching discrete implementation living in hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py. In step(), the previous sample is computed as prev_sample = sample + model_output.to(torch.float32) * dt where dt = sigmas[i+1] - sigmas[i], with the sample being upcast to float32 to avoid precision loss. Only the "euler" solver is supported; any other solver raises ValueError. Integer timestep inputs are explicitly rejected with a clear error message.

Source: hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py:140-185.

HunyuanVideoSampler.predict_noise_per_step and the surrounding inference_from_validation / inference_from_captions methods wire the pipeline together: they build the text encoders, compute freqs_cis (the RoPE cos/sin tables), and call self.pipeline(...) with output_type="pil". The argument data_type="video" if target_video_length > 1 else "image" lets the same pipeline drive both image and video outputs, which is why some community request threads (e.g. #128, #131, #172, #180, #198) for an image-to-video (I2V) checkpoint are still open — the public release only ships the T2V transformer.

Source: hyvideo/inference.py:60-160, community issues #128, #131, #172, #180, #198; hyvideo/config.py:80-140.

Supporting Components

Text Encoders

Two text encoders are wired up in hyvideo/text_encoder/__init__.py. clipL uses HuggingFace's CLIPTextModel, while llm loads a decoder-only language model via AutoModel. The LLM's norm is aliased to final_layer_norm so that downstream code can call .final_layer_norm(x) uniformly across encoders. The encoders are always switched to eval() mode and frozen (requires_grad_(False)). The dual-encoder setup is reflected in PROMPT_TEMPLATE constants that include both a generic image description template (dit-llm-encode) and a video-aware template (dit-llm-encode-video) with corresponding crop_start offsets (36 and 95 respectively).

Source: hyvideo/text_encoder/__init__.py:18-75, hyvideo/constants.py:30-60.

For community users who want to fine-tune, a one-off utility preprocess_text_encoder_tokenizer_utils.py extracts the language-model head and tokenizer from a LLaVA checkpoint into a pure text-encoder checkpoint compatible with the llm loader above. Combined with the --use-cpu-offload and --load-key module|ema flags exposed in hyvideo/config.py, this is the closest the repository currently gets to a fine-tuning pathway — community issue #302 explicitly requests a more complete fine-tuning script.

Source: hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py:8-30, community issue #302.

Causal 3D VAE

The VAE is AutoencoderKLCausal3D, a Diagonal-Gaussian autoencoder with causal 3D convolutions defined in hyvideo/vae/autoencoder_kl_causal_3d.py. It is registered as a Diffusers ModelMixin/ConfigMixin, supports gradient checkpointing, and yields (s_ratio, t_ratio) that the pipeline uses to snap requested output sizes to the VAE's spatial/temporal stride. The vae_tiling flag in hyvideo/config.py enables tiled decoding to fit long videos into limited VRAM.

Source: hyvideo/vae/autoencoder_kl_causal_3d.py:18-55, hyvideo/inference.py:35-50.

Configuration Surface

The CLI surface for inference is in hyvideo/config.py. Notable flags used by the pipeline are --infer-steps (default 50), --batch-size, --num-videos per prompt, --video-size (single int or [H, W], default 720×1280), --video-length in frames, --save-path, and the precision switches (--text-encoder-precision, --vae-precision, --dit-precision). Parallel inference is enabled through external launchers (torchrun --nproc_per_node=N) and the --ulysses-degree / --ring-degree flags that activate sequence-parallel attention — a frequent source of community confusion as seen in issue #249.

Source: hyvideo/config.py:30-160, community issue #249.

Common Failure Modes

CUDA / cuBLAS mismatch — community issue #317 documents that PyTorch compiled against the wrong CUDA runtime triggers linker errors; the recommended fix is to install nvidia-cublas-cu12==12.4.5.8 and set LD_LIBRARY_PATH to the conda site-packages, or use the official CUDA 12 Docker image.
Parallel inference hangs — issue #249 reports that torchrun --nproc_per_node=2 fails to start when the sequence-parallel groups are misconfigured; the same inference.py HunyuanVideoSampler must be invoked with matching --ulysses-degree and --ring-degree.
Out-of-memory on long videos — the VAE can be tiled (--vae-tiling) and the transformer is excluded from CPU offload; if OOM persists, reduce --video-length and --video-size in that order.
I2V / fine-tuning scripts — the open source release does not yet ship image-to-video weights (#128, #131, #172, #180, #198) or an official fine-tuning script (#302); users currently adapt inference.py and the text-encoder preprocess utility as a starting point.

Community Roadmap, Troubleshooting, and Known Issues

Related topics: HunyuanVideo Overview and System Architecture, Inference Workflows and Deployment Modes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Off-Topic Issue Volume

Continue reading this section for the full explanation and source context.

Section CUDA / cuBLAS Version Mismatch

Continue reading this section for the full explanation and source context.

Section Parallel Inference Failures

Continue reading this section for the full explanation and source context.

Community Roadmap, Troubleshooting, and Known Issues

This page consolidates information that is most relevant to users interacting with the public HunyuanVideo repository: the scope of what the open-source release currently supports, the recurring requests and reported issues observed in community discussions, the constraints imposed by the project's license, and the most common failure modes encountered during installation and inference. It is intended as an orientation guide for new users and a triage reference for contributors triaging issues.

Repository Scope and What the Codebase Supports

HunyuanVideo is shipped as a text-to-video (T2V) diffusion system built around a multimodal DiT backbone and a causal 3D VAE. The HunyuanVideo class defined in hyvideo/modules/models.py exposes configuration flags such as mm_double_blocks_depth, mm_single_blocks_depth, heads_num, hidden_size, rope_dim_list, and text_projection — these are the architectural knobs available to users who build on the released code. The DiT combines MMDoubleStreamBlock (separate text and video modulation, as in SD3/Flux) and MMSingleStreamBlock (parallel linear layers à la DiT).

The pipeline assembly, denoising loop, and CFG handling live in hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py, which subclasses DiffusionPipeline and integrates AutoencoderKL together with the FlowMatchDiscreteScheduler defined in hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py. The scheduler's step method performs an Euler update (prev_sample = sample + model_output.to(torch.float32) * dt), which is the only solver currently implemented — the else branch raises ValueError(f"Solver {self.config.solver} not supported.").

The text encoder wrapper in hyvideo/text_encoder/__init__.py supports both string templates (for LLM-style encoders) and chat-template lists via tokenizer.apply_chat_template. Prompt-format constants, including the negative prompt used by default, are defined in hyvideo/constants.py:

Constant	Purpose
`PROMPT_TEMPLATE_ENCODE`	System prompt for image description when using the LLM encoder
`PROMPT_TEMPLATE_ENCODE_VIDEO`	Multi-aspect system prompt for video description (theme, objects, actions, environment, camera)
`NEGATIVE_PROMPT`	Default CFG negative prompt (filters aerial view, overexposure, deformation, bad hands/teeth/limbs)
`PRECISIONS`	Allowed dtypes: `fp32`, `fp16`, `bf16`
`C_SCALE`	1e15, used as a PetaFLOPS scaler for tensorboard logging

Community Roadmap Signals

The most heavily engaged GitHub issues in this repository are not bug reports — they are requests for an Image-to-Video (I2V) checkpoint and inference script. Issues #180, #128, #172, #198, and #131 all ask, in English and Chinese, when an I2V model will be released. As of the current repository state, only the T2V checkpoint is downloadable per the instructions in ckpts/README.md, which documents hunyuan-video-t2v-720p (with transformers/mp_rank_00_model_states.pt and FP8 variants) under ckpts/. No I2V checkpoint directory is documented there.

The second recurring theme is fine-tuning. Issue #302 explicitly asks for "Official Fine-Tuning Code / Training Example." The shipped source tree contains the model definition (HunyuanVideo in models.py), the forward-pass pipeline, and inference-side utilities, but it does not include a training loop, a LoRA adapter implementation, or a dataset pipeline. Users seeking fine-tuning must therefore implement the training infrastructure themselves; the codebase provides the building blocks but not the trainer.

Known Issues and Common Failure Modes

Off-Topic Issue Volume

A substantial fraction of issues opened against the repository are not technical requests — they are creative-writing prompts (e.g., #318, #313, #311, #308, #292, #282) where users paste screenplay-style text and request generated videos. These issues do not reflect bugs and are not actionable by maintainers. Contributors triaging the tracker should close them as not planned or invalid.

CUDA / cuBLAS Version Mismatch

Issue #317 documents the canonical CUDA environment problem. The recommended fix published in the issue thread is:

# Option 1: install the matching cuBLAS and point LD_LIBRARY_PATH at it
pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/

# Option 2: use the official CUDA 12 Docker image
# Option 3: pin PyTorch and all CUDA-dependent libs to the CUDA 11.8 build

Users should verify their CUDA toolkit, cuBLAS (>=12.4.5.8), and cuDNN (>=9.00) versions before reporting new issues.

Parallel Inference Failures

Issue #249 reports a failure when launching multi-GPU inference with:

torchrun --nproc_per_node=2 sample_video.py \
    --video-size 1280 720 --video-length 129 --infer-steps 50 \
    --prompt "astronaut is fixing the space station." \
    --flow-reverse --seed 42 --ulysses-degree 2 --ring-degree 1

The Ulysses + Ring sequence-parallel configuration requires the distributed launcher flags to match the degree flags (--ulysses-degree, --ring-degree). When these are mismatched or when the backend (--ulysses-degree 0 to disable, or set environment variables RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT), the process group fails to initialize. The shipped scheduler (FlowMatchDiscreteScheduler.step) operates on per-rank tensors and assumes the parallel attention has already produced a full-sequence view, so initialization must succeed before the first step() call.

Prompt Template Truncation

The PROMPT_TEMPLATE dictionary in hyvideo/constants.py uses two crop_start values — 36 for the image encoder and 95 for the video encoder. If a user passes a custom prompt template, they must recompute crop_start to match the prefix length of the system prompt; otherwise the tokenized input will include system-prompt tokens that bias the generation away from the user's intent. The truncation is applied inside the text encoder wrapper's encode method.

Scheduler Step-Index State

FlowMatchDiscreteScheduler.step enforces a contract that integer indices are not accepted — only torch.FloatTensor values from scheduler.timesteps. Passing enumerate(timesteps) directly will raise ValueError per the explicit guard in scheduling_flow_match_discrete.py. Users porting code from diffusers' EulerDiscreteScheduler often trip this guard.

Licensing Constraints Affecting Deployment

The Tencent Hunyuan Community License Agreement (LICENSE.txt) imposes several restrictions that users should understand before deploying:

flowchart LR
    A[Use HunyuanVideo] --> B{Territory}
    B -- EU/UK/South Korea --> X[Not Licensed]
    B -- Worldwide ex. above --> C{Check Section 5 Rules}
    C --> D[No impersonation]
    C --> E[No high-stakes automation]
    C --> F[No military use]
    C --> G[No discrimination]
    C --> H[No violence/terrorism]
    D & E & F & G & H --> I[Compliant Deployment]
    A --> J{MAU > 100M?}
    J -- Yes --> K[Must request commercial license]
    J -- No --> I

Additional commercial trigger (LICENSE.txt): "If, on the Tencent Hunyuan version release date, the monthly active users of all products or services made available by or for Licensee is greater than 100 million monthly active users in the preceding calendar month, You must request a license from Tencent." Any redistribution must include the attribution string beginning with "Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright © 2024 Tencent."

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 12 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/249

2. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/302

3. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: identity.distribution | https://github.com/Tencent-Hunyuan/HunyuanVideo

4. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/317

5. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/Tencent-Hunyuan/HunyuanVideo

6. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/311

7. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/313

8. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/Tencent-Hunyuan/HunyuanVideo

9. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/Tencent-Hunyuan/HunyuanVideo

10. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/Tencent-Hunyuan/HunyuanVideo

11. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/Tencent-Hunyuan/HunyuanVideo

12. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/Tencent-Hunyuan/HunyuanVideo

Source: Doramagic discovery, validation, and Project Pack records

HunyuanVideo

HunyuanVideo Overview and System Architecture

Related Pages

HunyuanVideo Overview and System Architecture

Purpose and Scope

System Architecture

Key Components

DiT Backbone

Text Encoding and Prompt Templates

Causal 3D VAE

Flow-Matching Scheduler

Configuration and Usage Patterns

Community Engagement and Roadmap

See Also

Inference Workflows and Deployment Modes

Related Pages

Inference Workflows and Deployment Modes

Overview

Single-GPU CLI Inference

Multi-GPU and Sequence-Parallel Inference

FP8 Quantized Deployment

Gradio Web UI

End-to-End Data Flow

Configuration Reference

Common Failure Modes

See Also

Core Model Components and Diffusion Pipeline

Related Pages

Core Model Components and Diffusion Pipeline

Overview

Core DiT Model

MMDoubleStreamBlock

MMSingleStreamBlock

Diffusion Pipeline and Scheduler

Supporting Components

Text Encoders

Causal 3D VAE

Configuration Surface

Common Failure Modes

See Also

Community Roadmap, Troubleshooting, and Known Issues

Related Pages

Community Roadmap, Troubleshooting, and Known Issues

Repository Scope and What the Codebase Supports

Community Roadmap Signals

Known Issues and Common Failure Modes

Off-Topic Issue Volume

CUDA / cuBLAS Version Mismatch

Parallel Inference Failures

Prompt Template Truncation

Scheduler Step-Index State

Licensing Constraints Affecting Deployment

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Installation risk: Installation risk requires verification

3. Installation risk: Installation risk requires verification

4. Installation risk: Installation risk requires verification

5. Capability evidence risk: Capability evidence risk requires verification

6. Runtime risk: Runtime risk requires verification

7. Maintenance risk: Maintenance risk requires verification

8. Maintenance risk: Maintenance risk requires verification

9. Security or permission risk: Security or permission risk requires verification

10. Security or permission risk: Security or permission risk requires verification

11. Maintenance risk: Maintenance risk requires verification

12. Maintenance risk: Maintenance risk requires verification