Doramagic Project Pack · Human Manual
HunyuanVideo
HunyuanVideo: A Systematic Framework For Large Video Generation Model
HunyuanVideo Overview and System Architecture
Related topics: Inference Workflows and Deployment Modes, Core Model Components and Diffusion Pipeline
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Inference Workflows and Deployment Modes, Core Model Components and Diffusion Pipeline
HunyuanVideo Overview and System Architecture
Purpose and Scope
HunyuanVideo is an open-source text-to-video (T2V) generation system released by Tencent under the Tencent Hunyuan Community License Agreement (Source: LICENSE.txt:1-25). The repository provides the model weights, inference code, and configuration utilities needed to run high-resolution video synthesis on multi-GPU setups. As of this writing, the public release centers on the T2V checkpoint family, while the image-to-video (I2V) variant is on the public roadmap — a recurring point of community interest tracked in issues #128, #131, #172, #180, and #198.
The repository is organized so that pretrained weights are stored under HunyuanVideo/ckpts/ in a strict directory layout containing hunyuan-video-t2v-720p/transformers, vae, text_encoder, and text_encoder_2 (Source: ckpts/README.md:1-10). A second community-supported MLLM path, llava-llama-3-8b-v1_1-transformers, can be preprocessed into the text_encoder directory to save GPU memory (Source: ckpts/README.md:35-50).
System Architecture
The inference stack is a multi-stage latent diffusion pipeline. The user prompt is first optionally rewritten by an LLM in hyvideo/prompt_rewrite.py (Normal or Master mode), then encoded by two frozen text encoders, diffused in latent space by a DiT-style transformer, and finally decoded back to pixel space by a causal 3D VAE. A flow-matching scheduler governs the denoising trajectory.
flowchart LR
A[User Prompt] --> B[Prompt Rewriter<br/>hyvideo/prompt_rewrite.py]
B --> C[Text Encoder 1<br/>MLLM / LLaMA]
A --> D[Text Encoder 2<br/>CLIP ViT-L/14]
C --> E[HYVideoDiffusionTransformer<br/>hyvideo/modules/models.py]
D --> E
E --> F[FlowMatchDiscreteScheduler<br/>scheduling_flow_match_discrete.py]
F --> G[Causal 3D VAE<br/>autoencoder_kl_causal_3d.py]
G --> H[Output Video Frames]The pipeline entry point is HunyuanVideoPipeline, a subclass of DiffusionPipeline that wires together a VAE, two text encoders, a HYVideoDiffusionTransformer, and a KarrasDiffusionSchedulers-compatible scheduler (Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:10-30). It defines an explicit model_cpu_offload_seq of "text_encoder->text_encoder_2->transformer->vae" and excludes the transformer from CPU offload, indicating that the transformer is the most memory-hungry component (Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:32-35).
Key Components
DiT Backbone
HYVideoDiffusionTransformer is registered via @register_to_config and constructed by load_model in hyvideo/modules/__init__.py, which dispatches on the args.model key against the HUNYUAN_VIDEO_CONFIG registry (Source: hyvideo/modules/__init__.py:1-20). The default inference model is HYVideo-T/2-cfgdistill (Source: hyvideo/config.py:30-40). The transformer is composed of MMDoubleStreamBlock and MMSingleStreamBlock modules, mirroring the SD3 / Flux design where text and video tokens receive separate modulation before being fused (Source: hyvideo/modules/models.py:18-80). A SingleTokenRefiner block refines text token embeddings before they enter the joint attention layers.
Text Encoding and Prompt Templates
Two text encoders are loaded. The first is a decoder-only MLLM (default llava-llama-3-8b-v1_1-transformers), which requires a Llama-3 chat template to be applied via PROMPT_TEMPLATE_ENCODE (Source: hyvideo/constants.py:20-40). A dedicated PROMPT_TEMPLATE_ENCODE_VIDEO template instructs the same MLLM to describe videos across five aspects — content/theme, color/shape/spatial relations, actions/temporal changes, environment/style, and camera angles (Source: hyvideo/constants.py:28-50). The second encoder is openai/clip-vit-large-patch14, which provides an alternative CLIP embedding (Source: ckpts/README.md:25-30). A shared negative prompt is hard-coded: "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion" (Source: hyvideo/constants.py:50-55).
Causal 3D VAE
The decoder is a 3D causal VAE (AutoencoderKLCausal3D) modified from diffusers==0.29.2, supporting a fall-back import path against either the patched or upstream diffusers.loaders API (Source: hyvideo/vae/autoencoder_kl_causal_3d.py:15-30). The "causal" design preserves temporal causality across video frames, which is essential for coherent T2V outputs.
Flow-Matching Scheduler
Sampling is driven by FlowMatchDiscreteScheduler, derived from Stability AI's flow-matching implementation and adapted from diffusers==0.29.2 (Source: hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py:1-30). It exposes a step method that returns a FlowMatchDiscreteSchedulerOutput with prev_sample for iterative denoising.
Configuration and Usage Patterns
Argument parsing is centralized in hyvideo/config.py, which groups CLI flags into network, extra-models, denoise-schedule, inference, and parallel sections. The network group exposes --model, --latent-channels, --precision (fp32/fp16/bf16, default bf16), and --rope-theta (Source: hyvideo/config.py:25-55). Supported precisions, normalization types, and activation types are whitelisted as PRECISIONS, NORMALIZATION_TYPE, and ACTIVATION_TYPE sets in hyvideo/constants.py (Source: hyvideo/constants.py:1-20). The --flow-reverse flag and the --ulysses-degree / --ring-degree flags used in community-reported multi-GPU recipes (e.g. issue #249) are defined in the parallel-args group.
A typical inference command is torchrun --nproc_per_node=N sample_video.py --video-size 1280 720 --video-length 129 --infer-steps 50 --prompt "..." --flow-reverse --seed 42 --ulysses-degree N (Source: hyvideo/inference.py and issue #249). This shows the pipeline is designed for distributed execution: a torchrun launcher combined with sequence-parallel strategies (Ulysses, Ring) sharding the HYVideoDiffusionTransformer across devices.
Community Engagement and Roadmap
The community consistently raises three themes: (1) the I2V release date — tracked in issues #128, #131, #172, #180, and #198; (2) the absence of an official fine-tuning script, flagged in issue #302; and (3) installation/parallel-inference pitfalls such as the CUDA 12.4 / cuBLAS 12.4.5.8 / cuDNN 9.0 requirement discussed in issue #317. Users encountering multi-GPU hangs typically need to verify these driver versions before assuming a code defect.
See Also
- Pretrained Model Download Guide
- Pipeline Reference
- HYVideoDiffusionTransformer Architecture
- Flow-Matching Scheduler Details
- 3D Causal VAE
Source: https://github.com/Tencent-Hunyuan/HunyuanVideo / Human Manual
Inference Workflows and Deployment Modes
Related topics: HunyuanVideo Overview and System Architecture, Core Model Components and Diffusion Pipeline, Community Roadmap, Troubleshooting, and Known Issues
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: HunyuanVideo Overview and System Architecture, Core Model Components and Diffusion Pipeline, Community Roadmap, Troubleshooting, and Known Issues
Inference Workflows and Deployment Modes
Overview
HunyuanVideo ships multiple entry points that share a common inference core defined in hyvideo/inference.py (the HunyuanVideoSampler class). That core orchestrates three subsystems — a text encoder, a Multimodal DiT denoiser, and a causal 3D VAE — and drives them with a flow-matching discrete scheduler from hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py. Around this core, the repository exposes four deployment modes: a CLI single-GPU runner (sample_video.py), a multi-GPU sequence-parallel runner launched with torchrun, an FP8-quantized low-VRAM variant, and a Gradio web UI (gradio_server.py).
The shared pipeline is implemented in hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py and accepts pre-computed prompt_embeds/negative_prompt_embeds, raw latents, a generator, and guidance_rescale — the same surface used by every entry point below.
Single-GPU CLI Inference
The canonical entry point is sample_video.py, which parses command-line arguments, builds a HunyuanVideoSampler, and calls its predict(...) method. Default arguments are declared in hyvideo/config.py, where key flags include --model-base (root of the ckpts/ tree), --dit-weight (defaults to ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt), --model-resolution (540p or 720p), --use-cpu-offload, and --load-key (module or ema). Resolution is also tied to model selection: sample_video.py automatically picks the matching DiT checkpoint and VAE for the requested model-resolution.
The reference launcher scripts/run_sample_video.sh invokes:
python sample_video.py \
--prompt "A cat walks on the grass, realistic style." \
--video-size 1280 720 --video-length 129 --infer-steps 50 \
--flow-reverse --seed 42 --ulysses-degree 1 --ring-degree 1
Inside HunyuanVideoSampler.predict, the prompt is routed through TextEncoder (hyvideo/text_encoder/__init__.py), which selects PROMPT_TEMPLATE["dit-llm-encode-video"] from hyvideo/constants.py and applies the LLaMA-style chat template (PROMPT_TEMPLATE_ENCODE_VIDEO) before tokenization. Negative prompts default to the string in hyvideo/constants.py (NEGATIVE_PROMPT = "Aerial view, aerial view, overexposed, low quality, ..."). The resulting embeddings, together with flow_shift, guidance_scale, and embedded_guidance_scale, are passed to self.pipeline(...), whose data_type argument is automatically set to "video" when target_video_length > 1 else "image".
Multi-GPU and Sequence-Parallel Inference
For multi-GPU runs the project provides scripts/run_sample_video_multigpu.sh, which wraps the same script under torchrun:
torchrun --nproc_per_node=$NGPU sample_video.py \
--video-size 1280 720 --video-length 129 --infer-steps 50 \
--prompt "..." --flow-reverse --seed 42 \
--ulysses-degree $ULYSSES_DEGREE --ring-degree $RING_DEGREE
The --ulysses-degree and --ring-degree flags enable DeepSpeed Ulysses sequence parallelism and ring attention respectively; the model enables them by calling parallel_attention from hyvideo/modules/attenion.py when those degrees are greater than one. The DiT block in hyvideo/modules/models.py exposes MMDoubleStreamBlock and MMSingleStreamBlock layers (20 + 40 by default) that participate in distributed attention.
This is the same code path users hit in community issue #249, where parallel inference failed mid-run. The most common root causes reported there are (a) mismatched CUDA / cuBLAS versions (the project requires nvidia-cublas-cu12==12.4.5.8 plus LD_LIBRARY_PATH pointed at the conda cublas libs, or the bundled CUDA 12 Docker image — see issue #317), and (b) --ulysses-degree / --ring-degree not evenly dividing the head count. The model has heads_num=24 (see HunyuanVideo constructor in hyvideo/modules/models.py), so valid ulysses-degree * ring-degree combinations must divide 24.
FP8 Quantized Deployment
To reduce VRAM, the ckpts/ tree ships an FP8-everything checkpoint in addition to the bf16 weight. The launcher scripts/run_sample_video_fp8.sh activates it with:
--dit-weight ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt \
--dit-weight-map ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8_map.pt
The sampler reads the FP8 weight together with its scale map, dequantizes on the fly, and reuses the same pipeline as the bf16 path — there is no separate model architecture for the FP8 variant. Per ckpts/README.md, both formats live under ckpts/hunyuan-video-t2v-720p/transformers/ alongside the VAE and the dual text encoders (text_encoder, text_encoder_2).
Gradio Web UI
gradio_server.py provides a browser interface that wraps HunyuanVideoSampler. It exposes resolution presets (1280x720, 720x1280, 1104x832, etc.), inference steps, guidance scale, embedded guidance scale, flow shift, and a negative-prompt textbox. On submit it calls infer(...) with num_videos_per_prompt=1, batch_size=1, and writes the resulting video via save_videos_grid to gradio_outputs/<timestamp>_seed<seed>_<prompt-slug>.mp4 at 24 fps. The default seed prompt is "A cat walks on the grass, realistic style.".
End-to-End Data Flow
flowchart LR
A[Prompt] --> B[TextEncoder\nLLaMA-style template]
N[Negative Prompt] --> B
B --> C[Prompt / Neg Embeddings]
C --> D[DiT Denoiser\nFlow-Match Scheduler]
Z[Random Latents] --> D
D --> E[Clean Latents]
E --> F[Causal 3D VAE\nDecode]
F --> G[Video Frames / Image]
G --> H[save_videos_grid → MP4]Configuration Reference
| Flag | Default | Purpose |
|---|---|---|
--model-base | ckpts | Root of all model weights. |
--dit-weight | .../mp_rank_00_model_states.pt | DiT checkpoint (bf16 or FP8). |
--model-resolution | 540p | Selects 540p or 720p DiT + VAE. |
--use-cpu-offload | off | Offload DiT/VAE/TextEncoder to CPU. |
--load-key | module | module (weights) or ema (EMA copy). |
--ulysses-degree | 1 | Ulysses sequence-parallel degree. |
--ring-degree | 1 | Ring-attention degree. |
--flow-shift | per resolution | Shifts the flow-matching sigma schedule. |
--infer-steps | — | Number of denoising iterations. |
--video-length / --video-size | — | Output frame count and (H, W). |
Common Failure Modes
- cuBLAS / CUDA mismatch — Issue #317 documents the need for
nvidia-cublas-cu12==12.4.5.8and a matchingLD_LIBRARY_PATH, or the official CUDA 12 Docker image. - Parallel-inference crash on multi-GPU — Issue #249 traces the failure to improper setup of
--ulysses-degree/--ring-degree; both must be1for single-GPU runs. - Missing checkpoints —
ckpts/README.mdrequireshuggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts; the sampler raises iftext_encoder,text_encoder_2,vae, or the DiT weights cannot be resolved under--model-base.
See Also
- HunyuanVideo GitHub: <https://github.com/Tencent-Hunyuan/HunyuanVideo>
- Community thread on parallel inference: issue #249
- Community thread on CUDA/cuBLAS setup: issue #317
- Community thread on Image-to-Video model release roadmap: issues #128, #131, #180
Source: https://github.com/Tencent-Hunyuan/HunyuanVideo / Human Manual
Core Model Components and Diffusion Pipeline
Related topics: HunyuanVideo Overview and System Architecture, Inference Workflows and Deployment Modes
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: HunyuanVideo Overview and System Architecture, Inference Workflows and Deployment Modes
Core Model Components and Diffusion Pipeline
Overview
The HunyuanVideo repository implements a text-to-video (T2V) generation system whose center of gravity is a multimodal Diffusion Transformer (DiT) combined with a causal 3D VAE, dual text encoders, and a flow-matching Euler scheduler wrapped in a Diffusers-style pipeline. The model definition lives in hyvideo/modules/models.py, the inference orchestration in hyvideo/inference.py, and the diffusion loop in hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py. The design intentionally mirrors SD3 and Flux.1, but introduces a 3D-aware video latent path and a "flow-matching" discrete scheduler instead of classical DDPM.
The pipeline expects four major sub-networks to be loaded from ckpts/ (see ckpts/README.md): the DiT transformer (hunyuan-video-t2v-720p/transformers/), the causal 3D VAE (hunyuan-video-t2v-720p/vae/), and two text encoders (text_encoder for CLIP-L, text_encoder_2 for the LLM). A high-level view of the runtime data flow is shown below.
flowchart LR
P[Prompt] --> TE1[CLIP-L Text Encoder]
P --> TE2[LLM Text Encoder + Refiner]
TE1 --> Proj[Text Projection]
TE2 --> Proj
Proj --> DiT[HYVideoDiffusionTransformer]
N[Random Noise] --> Sched[Flow-Match Scheduler]
Sched --> DiT
DiT --> Latent[Video Latent]
Latent --> VAE[Causal 3D VAE Decoder]
VAE --> Out[Output Video / Image]Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:18-37, hyvideo/inference.py:53-145.
Core DiT Model
The HYVideoDiffusionTransformer defined in hyvideo/modules/models.py is a dual-stream then single-stream DiT. The constructor accepts mm_double_blocks_depth=20 and mm_single_blocks_depth=40 by default, producing 20 multimodal double-stream blocks followed by 40 single-stream blocks. Key initialization arguments are summarized below.
| Argument | Default | Role |
|---|---|---|
patch_size | [1, 2, 2] | 3D patch size: 1 along temporal axis, 2×2 spatially |
in_channels | 4 | Latent channels from the VAE |
hidden_size | 3072 | Transformer hidden width |
heads_num | 24 | Attention heads |
mlp_width_ratio | 4.0 | MLP expansion factor |
rope_dim_list | [16, 56, 56] | RoPE split across T, H, W axes |
text_projection | "single_refiner" | Token refiner configuration for text features |
guidance_embed | False | Reserved for distillation guidance |
use_attention_mask | True | Pad-mask text tokens during attention |
Source: hyvideo/modules/models.py:80-110.
MMDoubleStreamBlock
MMDoubleStreamBlock runs two parallel streams — one for visual tokens and one for text tokens — and only lets them interact through a joint attention operation. The visual path applies img_mod modulation; the text path applies txt_mod. Both projections produce QKV plus an MLP gate (mlp_in), and the class explicitly cites SD3 (arXiv:2403.03206) and Flux.1 as design references. RoPE is applied through apply_rotary_emb from hyvideo/modules/posemb_layers.py using the per-axis rope_dim_list.
Source: hyvideo/modules/models.py:21-78.
MMSingleStreamBlock
MMSingleStreamBlock collapses the two streams into one. It uses a fused linear1 projection that emits 3 * hidden_size + mlp_hidden_dim channels (QKV + MLP input in a single matmul) and a fused linear2 that combines attention output and MLP output. This pattern is the same "parallel linear layers" trick used in arXiv:2302.05442. QK normalization (qk_norm=True, qk_norm_type="rms") is applied before scaled-dot-product attention, with qk_scale = head_dim ** -0.5.
Source: hyvideo/modules/models.py:130-185.
Diffusion Pipeline and Scheduler
The pipeline is a thin wrapper around Diffusers' DiffusionPipeline, with the offload sequence declared as "text_encoder->text_encoder_2->transformer->vae" and the transformer explicitly excluded from CPU offload (_exclude_from_cpu_offload = ["transformer"]). Optional components include text_encoder_2 (the LLM is optional in this construction). The pipeline accepts prompts, negative prompts, latent noise, guidance scale, and an embedded_guidance_scale that is forwarded into the denoising loop.
Source: hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:18-65, hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py:80-120.
The scheduler is a custom flow-matching discrete implementation living in hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py. In step(), the previous sample is computed as prev_sample = sample + model_output.to(torch.float32) * dt where dt = sigmas[i+1] - sigmas[i], with the sample being upcast to float32 to avoid precision loss. Only the "euler" solver is supported; any other solver raises ValueError. Integer timestep inputs are explicitly rejected with a clear error message.
Source: hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py:140-185.
HunyuanVideoSampler.predict_noise_per_step and the surrounding inference_from_validation / inference_from_captions methods wire the pipeline together: they build the text encoders, compute freqs_cis (the RoPE cos/sin tables), and call self.pipeline(...) with output_type="pil". The argument data_type="video" if target_video_length > 1 else "image" lets the same pipeline drive both image and video outputs, which is why some community request threads (e.g. #128, #131, #172, #180, #198) for an image-to-video (I2V) checkpoint are still open — the public release only ships the T2V transformer.
Source: hyvideo/inference.py:60-160, community issues #128, #131, #172, #180, #198; hyvideo/config.py:80-140.
Supporting Components
Text Encoders
Two text encoders are wired up in hyvideo/text_encoder/__init__.py. clipL uses HuggingFace's CLIPTextModel, while llm loads a decoder-only language model via AutoModel. The LLM's norm is aliased to final_layer_norm so that downstream code can call .final_layer_norm(x) uniformly across encoders. The encoders are always switched to eval() mode and frozen (requires_grad_(False)). The dual-encoder setup is reflected in PROMPT_TEMPLATE constants that include both a generic image description template (dit-llm-encode) and a video-aware template (dit-llm-encode-video) with corresponding crop_start offsets (36 and 95 respectively).
Source: hyvideo/text_encoder/__init__.py:18-75, hyvideo/constants.py:30-60.
For community users who want to fine-tune, a one-off utility preprocess_text_encoder_tokenizer_utils.py extracts the language-model head and tokenizer from a LLaVA checkpoint into a pure text-encoder checkpoint compatible with the llm loader above. Combined with the --use-cpu-offload and --load-key module|ema flags exposed in hyvideo/config.py, this is the closest the repository currently gets to a fine-tuning pathway — community issue #302 explicitly requests a more complete fine-tuning script.
Source: hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py:8-30, community issue #302.
Causal 3D VAE
The VAE is AutoencoderKLCausal3D, a Diagonal-Gaussian autoencoder with causal 3D convolutions defined in hyvideo/vae/autoencoder_kl_causal_3d.py. It is registered as a Diffusers ModelMixin/ConfigMixin, supports gradient checkpointing, and yields (s_ratio, t_ratio) that the pipeline uses to snap requested output sizes to the VAE's spatial/temporal stride. The vae_tiling flag in hyvideo/config.py enables tiled decoding to fit long videos into limited VRAM.
Source: hyvideo/vae/autoencoder_kl_causal_3d.py:18-55, hyvideo/inference.py:35-50.
Configuration Surface
The CLI surface for inference is in hyvideo/config.py. Notable flags used by the pipeline are --infer-steps (default 50), --batch-size, --num-videos per prompt, --video-size (single int or [H, W], default 720×1280), --video-length in frames, --save-path, and the precision switches (--text-encoder-precision, --vae-precision, --dit-precision). Parallel inference is enabled through external launchers (torchrun --nproc_per_node=N) and the --ulysses-degree / --ring-degree flags that activate sequence-parallel attention — a frequent source of community confusion as seen in issue #249.
Source: hyvideo/config.py:30-160, community issue #249.
Common Failure Modes
- CUDA / cuBLAS mismatch — community issue #317 documents that PyTorch compiled against the wrong CUDA runtime triggers linker errors; the recommended fix is to install
nvidia-cublas-cu12==12.4.5.8and setLD_LIBRARY_PATHto the conda site-packages, or use the official CUDA 12 Docker image. - Parallel inference hangs — issue #249 reports that
torchrun --nproc_per_node=2fails to start when the sequence-parallel groups are misconfigured; the sameinference.pyHunyuanVideoSamplermust be invoked with matching--ulysses-degreeand--ring-degree. - Out-of-memory on long videos — the VAE can be tiled (
--vae-tiling) and the transformer is excluded from CPU offload; if OOM persists, reduce--video-lengthand--video-sizein that order. - I2V / fine-tuning scripts — the open source release does not yet ship image-to-video weights (#128, #131, #172, #180, #198) or an official fine-tuning script (#302); users currently adapt
inference.pyand the text-encoder preprocess utility as a starting point.
See Also
- Pretrained Checkpoints Download Guide
- HunyuanVideo inference command-line reference (hyvideo/config.py)
- Tencent Hunyuan Community License Agreement (LICENSE.txt) — note the Territory restriction (excluding EU, UK, and South Korea) and the 100M-MAU commercial clause.
Source: https://github.com/Tencent-Hunyuan/HunyuanVideo / Human Manual
Community Roadmap, Troubleshooting, and Known Issues
Related topics: HunyuanVideo Overview and System Architecture, Inference Workflows and Deployment Modes
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: HunyuanVideo Overview and System Architecture, Inference Workflows and Deployment Modes
Community Roadmap, Troubleshooting, and Known Issues
This page consolidates information that is most relevant to users interacting with the public HunyuanVideo repository: the scope of what the open-source release currently supports, the recurring requests and reported issues observed in community discussions, the constraints imposed by the project's license, and the most common failure modes encountered during installation and inference. It is intended as an orientation guide for new users and a triage reference for contributors triaging issues.
Repository Scope and What the Codebase Supports
HunyuanVideo is shipped as a text-to-video (T2V) diffusion system built around a multimodal DiT backbone and a causal 3D VAE. The HunyuanVideo class defined in hyvideo/modules/models.py exposes configuration flags such as mm_double_blocks_depth, mm_single_blocks_depth, heads_num, hidden_size, rope_dim_list, and text_projection — these are the architectural knobs available to users who build on the released code. The DiT combines MMDoubleStreamBlock (separate text and video modulation, as in SD3/Flux) and MMSingleStreamBlock (parallel linear layers à la DiT).
The pipeline assembly, denoising loop, and CFG handling live in hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py, which subclasses DiffusionPipeline and integrates AutoencoderKL together with the FlowMatchDiscreteScheduler defined in hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py. The scheduler's step method performs an Euler update (prev_sample = sample + model_output.to(torch.float32) * dt), which is the only solver currently implemented — the else branch raises ValueError(f"Solver {self.config.solver} not supported.").
The text encoder wrapper in hyvideo/text_encoder/__init__.py supports both string templates (for LLM-style encoders) and chat-template lists via tokenizer.apply_chat_template. Prompt-format constants, including the negative prompt used by default, are defined in hyvideo/constants.py:
| Constant | Purpose |
|---|---|
PROMPT_TEMPLATE_ENCODE | System prompt for image description when using the LLM encoder |
PROMPT_TEMPLATE_ENCODE_VIDEO | Multi-aspect system prompt for video description (theme, objects, actions, environment, camera) |
NEGATIVE_PROMPT | Default CFG negative prompt (filters aerial view, overexposure, deformation, bad hands/teeth/limbs) |
PRECISIONS | Allowed dtypes: fp32, fp16, bf16 |
C_SCALE | 1e15, used as a PetaFLOPS scaler for tensorboard logging |
Community Roadmap Signals
The most heavily engaged GitHub issues in this repository are not bug reports — they are requests for an Image-to-Video (I2V) checkpoint and inference script. Issues #180, #128, #172, #198, and #131 all ask, in English and Chinese, when an I2V model will be released. As of the current repository state, only the T2V checkpoint is downloadable per the instructions in ckpts/README.md, which documents hunyuan-video-t2v-720p (with transformers/mp_rank_00_model_states.pt and FP8 variants) under ckpts/. No I2V checkpoint directory is documented there.
The second recurring theme is fine-tuning. Issue #302 explicitly asks for "Official Fine-Tuning Code / Training Example." The shipped source tree contains the model definition (HunyuanVideo in models.py), the forward-pass pipeline, and inference-side utilities, but it does not include a training loop, a LoRA adapter implementation, or a dataset pipeline. Users seeking fine-tuning must therefore implement the training infrastructure themselves; the codebase provides the building blocks but not the trainer.
Known Issues and Common Failure Modes
Off-Topic Issue Volume
A substantial fraction of issues opened against the repository are not technical requests — they are creative-writing prompts (e.g., #318, #313, #311, #308, #292, #282) where users paste screenplay-style text and request generated videos. These issues do not reflect bugs and are not actionable by maintainers. Contributors triaging the tracker should close them as not planned or invalid.
CUDA / cuBLAS Version Mismatch
Issue #317 documents the canonical CUDA environment problem. The recommended fix published in the issue thread is:
# Option 1: install the matching cuBLAS and point LD_LIBRARY_PATH at it
pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
# Option 2: use the official CUDA 12 Docker image
# Option 3: pin PyTorch and all CUDA-dependent libs to the CUDA 11.8 build
Users should verify their CUDA toolkit, cuBLAS (>=12.4.5.8), and cuDNN (>=9.00) versions before reporting new issues.
Parallel Inference Failures
Issue #249 reports a failure when launching multi-GPU inference with:
torchrun --nproc_per_node=2 sample_video.py \
--video-size 1280 720 --video-length 129 --infer-steps 50 \
--prompt "astronaut is fixing the space station." \
--flow-reverse --seed 42 --ulysses-degree 2 --ring-degree 1
The Ulysses + Ring sequence-parallel configuration requires the distributed launcher flags to match the degree flags (--ulysses-degree, --ring-degree). When these are mismatched or when the backend (--ulysses-degree 0 to disable, or set environment variables RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT), the process group fails to initialize. The shipped scheduler (FlowMatchDiscreteScheduler.step) operates on per-rank tensors and assumes the parallel attention has already produced a full-sequence view, so initialization must succeed before the first step() call.
Prompt Template Truncation
The PROMPT_TEMPLATE dictionary in hyvideo/constants.py uses two crop_start values — 36 for the image encoder and 95 for the video encoder. If a user passes a custom prompt template, they must recompute crop_start to match the prefix length of the system prompt; otherwise the tokenized input will include system-prompt tokens that bias the generation away from the user's intent. The truncation is applied inside the text encoder wrapper's encode method.
Scheduler Step-Index State
FlowMatchDiscreteScheduler.step enforces a contract that integer indices are not accepted — only torch.FloatTensor values from scheduler.timesteps. Passing enumerate(timesteps) directly will raise ValueError per the explicit guard in scheduling_flow_match_discrete.py. Users porting code from diffusers' EulerDiscreteScheduler often trip this guard.
Licensing Constraints Affecting Deployment
The Tencent Hunyuan Community License Agreement (LICENSE.txt) imposes several restrictions that users should understand before deploying:
flowchart LR
A[Use HunyuanVideo] --> B{Territory}
B -- EU/UK/South Korea --> X[Not Licensed]
B -- Worldwide ex. above --> C{Check Section 5 Rules}
C --> D[No impersonation]
C --> E[No high-stakes automation]
C --> F[No military use]
C --> G[No discrimination]
C --> H[No violence/terrorism]
D & E & F & G & H --> I[Compliant Deployment]
A --> J{MAU > 100M?}
J -- Yes --> K[Must request commercial license]
J -- No --> IAdditional commercial trigger (LICENSE.txt): "If, on the Tencent Hunyuan version release date, the monthly active users of all products or services made available by or for Licensee is greater than 100 million monthly active users in the preceding calendar month, You must request a license from Tencent." Any redistribution must include the attribution string beginning with "Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright © 2024 Tencent."
See Also
- Architecture overview and DiT block definitions:
hyvideo/modules/models.py - Denoising loop and CFG behavior:
hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py - Scheduler semantics and step contract:
hyvideo/diffusion/schedulers/scheduling_flow_match_discrete.py - Prompt templates, negative prompt, and precision enums:
hyvideo/constants.py - 3D causal VAE used for latent encoding/decoding:
hyvideo/vae/autoencoder_kl_causal_3d.py - Causal UNet residual/attention blocks:
hyvideo/vae/unet_causal_3d_blocks.py - Text encoder wrapper and tokenizer integration:
hyvideo/text_encoder/__init__.py - Checkpoint layout and download instructions:
ckpts/README.md
Source: https://github.com/Tencent-Hunyuan/HunyuanVideo / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 12 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/249
2. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/302
3. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: identity.distribution | https://github.com/Tencent-Hunyuan/HunyuanVideo
4. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/317
5. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/Tencent-Hunyuan/HunyuanVideo
6. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/311
7. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/Tencent-Hunyuan/HunyuanVideo/issues/313
8. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/Tencent-Hunyuan/HunyuanVideo
9. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/Tencent-Hunyuan/HunyuanVideo
10. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/Tencent-Hunyuan/HunyuanVideo
11. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/Tencent-Hunyuan/HunyuanVideo
12. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/Tencent-Hunyuan/HunyuanVideo
Source: Doramagic discovery, validation, and Project Pack records