HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Liyang Chen * , Tianxiang Ma * , Jiawei Liu, Bingchuan Liβ , Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu Β§
* Equal contribution, β Project lead, Β§ Corresponding author
Tsinghua University | Intelligent Creation Team, ByteDance
π₯ Latest News
- A Best-Practice Guide for HuMo will be released soon. Stay tuned.
- Sep 16, 2025: π₯π₯ We release the 1.7B weights, which generate a 480P video in 8 minutes on a 32G GPU. The visual quality is lower than that of the 17B model, but the audio-visual sync remains nearly unaffected.
- Sep 13, 2025: π₯π₯ The 17B model is merged into ComfyUI-Wan. Thank kijai for the update!
- Sep 10, 2025: π₯π₯ We release the 17B weights and inference codes.
- Sep 9, 2025: We release the project page and Technique-Report of HuMo
β¨ Key Features
HuMo is a unified, human-centric video generation framework designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputsβincluding text, images, and audio. It supports strong text prompt following, consistent subject preservation, synchronized audio-driven motion.
- ββVideoGen from Text-Imageββ - Customize character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images.
- ββVideoGen from Text-Audioββ - Generate audio-synchronized videos solely from text and audio inputs, removing the need for image references and enabling greater creative freedom.
- ββVideoGen from Text-Image-Audioββ - Achieve the higher level of customization and control by combining text, image, and audio guidance.
π Todo List
- Release Paper
- Checkpoint of HuMo-17B
- Checkpoint of HuMo-1.7B
- Inference Codes
- Text-Image Input
- Text-Audio Input
- Text-Image-Audio Input
- Multi-GPU Inference
- Best-Practice Guide for HuMo
- Prompts to Generate Demo of Faceless Thrones
- Training Data
β‘οΈ Quickstart
Installation
conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
Model Preparation
Models | Download Link | Notes |
---|---|---|
HuMo-17B | π€ Huggingface | Supports 480P & 720P |
HuMo-1.7B | π€ Huggingface | Lightweight on 32G GPU |
Wan-2.1 | π€ Huggingface | VAE & Text encoder |
Whisper-large-v3 | π€ Huggingface | Audio encoder |
Audio separator | π€ Huggingface | Remove background noise (optional) |
Download models using huggingface-cli:
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator
Run Multimodal-Condition-to-Video Generation
Our model is compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality.
Some tips
- Please prepare your text, reference images and audio as described in test_case.json.
- We support Multi-GPU inference using FSDP + Sequence Parallel.
- βThe model is trained on 97-frame videos at 25 FPS. Generating video longer than 97 frames may degrade the performance. We will provide a new checkpoint for longer generation.
Configure HuMo
HuMoβs behavior and output can be customized by modifying generate.yaml configuration file.
The following parameters control generation length, video resolution, and how text, image, and audio inputs are balanced:
generation:
frames: <int> # Number of frames for the generated video.
scale_a: <float> # Strength of audio guidance. Higher = better audio-motion sync.
scale_t: <float> # Strength of text guidance. Higher = better adherence to text prompts.
mode: "TA" # Input mode: "TA" for text+audio; "TIA" for text+image+audio.
height: 720 # Video height (e.g., 720 or 480).
width: 1280 # Video width (e.g., 1280 or 832).
dit:
sp_size: <int> # Sequence parallelism size. Set this equal to the number of used GPUs.
diffusion:
timesteps:
sampling:
steps: 50 # Number of denoising steps. Lower (30β40) = faster generation.
1. Text-Audio Input
bash scripts/infer_ta.sh # infer with 17B model
bash scripts/infer_ta_1_7B.sh # infer with 1.7B model
2. Text-Image-Audio Input
bash scripts/infer_tia.sh # infer with 17B model
bash scripts/infer_tia_1_7B.sh # infer with 1.7B model
Acknowledgements
Our work builds upon and is greatly inspired by several outstanding open-source projects, including Phantom, SeedVR, MEMO, Hallo3, OpenHumanVid, OpenS2V-Nexus, ConsisID and Whisper. We sincerely thank the authors and contributors of these projects for generously sharing their excellent codes and ideas.
β Citation
If HuMo is helpful, please help to β the repo.
If you find this project useful for your research, please consider citing our paper.
BibTeX
@misc{chen2025humo,
title={HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning},
author={Liyang Chen and Tianxiang Ma and Jiawei Liu and Bingchuan Li and Zhuowei Chen and Lijie Liu and Xu He and Gen Li and Qian He and Zhiyong Wu},
year={2025},
eprint={2509.08519},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.08519},
}
π§ Contact
If you have any comments or questions regarding this open-source project, please open a new issue or contact Liyang Chen and Tianxiang Ma.
- Downloads last month
- 678