ZheWang123 commited on
Commit
43ffe59
Β·
verified Β·
1 Parent(s): 00b241c

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +325 -0
  2. dw-ll_ucoco_384.onnx +3 -0
  3. yolox_l.onnx +3 -0
README.md ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # UniAnimate-DiT: Human Image Animation with Large-Scaled Video Diffusion Transformer
2
+
3
+ This repo contains checkpoints for UniAnimate-DiT:
4
+
5
+ - UniAnimate-Wan2.1-14B-Lora-12000.ckpt: the weights of LoRAs and additional learnable modules with 12000 training steps.
6
+
7
+ - dw-ll_ucoco_384.onnx: dwpose model used for pose extraction.
8
+
9
+ - yolox_l.onnx: model used for pose extraction.
10
+
11
+
12
+ # UniAnimate-DiT
13
+
14
+ An expanded version of [UniAnimate](https://arxiv.org/abs/2406.01188) based on [Wan2.1](https://github.com/Wan-Video/Wan2.1)
15
+
16
+ UniAnimate-DiT is based on a state-of-the-art DiT-based Wan2.1-14B-I2V model for consistent human image animation. Wan2.1 is a collection of video synthesis models open-sourced by Alibaba. Our code is based on [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio), thanks for the nice open-sourced project.
17
+
18
+
19
+ <table>
20
+ <center>
21
+ <tr>
22
+ <!-- <td width=25% style="border: none"> -->
23
+ <td ><center>
24
+ <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/wuGhP_lx9FMOZcibpgPg3HAFQXRt2cPSQPMkG2mxQGw.mp4" muted="false"></video>
25
+ </td>
26
+ <td ><center>
27
+ <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/K4lR1PGHnWYc7X6-e8dkZ_60TCr0_6Gqjc3rT6CW7Xw.mp4" muted="false"></video>
28
+ </td>
29
+ </tr>
30
+ </table>
31
+
32
+
33
+ ## Getting Started with UniAnimate-DiT
34
+
35
+
36
+ ### (1) Installation
37
+
38
+ Before using this model, please create the conda environment and install DiffSynth-Studio from **source code**.
39
+
40
+ ```shell
41
+ conda create -n UniAnimate-Wan python=3.9.21
42
+ conda activate UniAnimate-Wan
43
+
44
+ # CUDA 11.8
45
+ pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu118
46
+ # CUDA 12.1
47
+ pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
48
+ # CUDA 12.4
49
+ pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
50
+
51
+ git clone https://github.com/ali-vilab/UniAnimate-DiT.git
52
+ cd UniAnimate-DiT
53
+ pip install -e .
54
+ ```
55
+
56
+ UniAnimate-DiT supports multiple Attention implementations. If you have installed any of the following Attention implementations, they will be enabled based on priority.
57
+
58
+ * [Flash Attention 3](https://github.com/Dao-AILab/flash-attention)
59
+ * [Flash Attention 2](https://github.com/Dao-AILab/flash-attention)
60
+ * [Sage Attention](https://github.com/thu-ml/SageAttention)
61
+ * [torch SDPA](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) (default. `torch>=2.5.0` is recommended.)
62
+
63
+ ## Inference
64
+
65
+
66
+ ### (2) Download the pretrained checkpoints
67
+
68
+ Download Wan2.1-14B-I2V-720P models using huggingface-cli:
69
+ ```
70
+ pip install "huggingface_hub[cli]"
71
+ huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./Wan2.1-I2V-14B-720P
72
+ ```
73
+
74
+ Or download Wan2.1-14B-I2V-720P models using modelscope-cli:
75
+ ```
76
+ pip install modelscope
77
+ modelscope download Wan-AI/Wan2.1-I2V-14B-720P --local_dir ./Wan2.1-I2V-14B-720P
78
+ ```
79
+
80
+
81
+
82
+
83
+ Download pretrained UniAnimate-DiT models (only include the weights of lora and additional learnable modules):
84
+ ```
85
+ pip install modelscope
86
+ modelscope download xiaolaowx/UniAnimate-DiT --local_dir ./checkpoints
87
+ ```
88
+
89
+ Finally, the model weights will be organized in `./checkpoints/` as follows:
90
+ ```
91
+ ./checkpoints/
92
+ |---- dw-ll_ucoco_384.onnx
93
+ |---- UniAnimate-Wan2.1-14B-Lora-12000.ckpt
94
+ β””---- yolox_l.onnx
95
+ ```
96
+
97
+
98
+
99
+
100
+ ### (3) Pose alignment
101
+
102
+ Rescale the target pose sequence to match the pose of the reference image (you can also install `pip install onnxruntime-gpu==1.18.1` for faster extraction on GPU.):
103
+ ```
104
+ # reference image 1
105
+ python run_align_pose.py --ref_name data/images/WOMEN-Blouses_Shirts-id_00004955-01_4_full.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/WOMEN-Blouses_Shirts-id_00004955-01_4_full
106
+
107
+ # reference image 2
108
+ python run_align_pose.py --ref_name data/images/musk.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/musk
109
+
110
+ # reference image 3
111
+ python run_align_pose.py --ref_name data/images/WOMEN-Blouses_Shirts-id_00005125-03_4_full.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/WOMEN-Blouses_Shirts-id_00005125-03_4_full
112
+
113
+ # reference image 4
114
+ python run_align_pose.py --ref_name data/images/IMG_20240514_104337.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/IMG_20240514_104337
115
+
116
+ # reference image 5
117
+ python run_align_pose.py --ref_name data/images/10.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/10
118
+ ```
119
+ The processed target pose for demo videos will be in ```data/saved_pose```. `--ref_name` denotes the path of reference image, `--source_video_paths` provides the source poses, `--saved_pose_dir` means the path of processed target poses.
120
+
121
+
122
+ ### (4) Run UniAnimate-Wan2.1-14B-I2V to generate 480P videos
123
+
124
+ ```
125
+ CUDA_VISIBLE_DEVICES="0" python examples/unianimate_wan/inference_unianimate_wan_480p.py
126
+ ```
127
+ About 23G GPU memory is needed. After this, 81-frame video clips with 832x480 (hight x width) resolution will be generated under the `./outputs` folder:
128
+
129
+ For long video generation, run the following comment:
130
+
131
+ ```
132
+ CUDA_VISIBLE_DEVICES="0" python examples/unianimate_wan/inference_unianimate_wan_long_video_480p.py
133
+ ```
134
+
135
+ ### (5) Run UniAnimate-Wan2.1-14B-I2V to generate 720P videos
136
+
137
+ ```
138
+ CUDA_VISIBLE_DEVICES="0" python examples/unianimate_wan/inference_unianimate_wan_720p.py
139
+ ```
140
+ About 36G GPU memory is needed. After this, 81-frame video clips with 1280x720 resolution will be generated:
141
+
142
+
143
+ Note: Even though our model was trained on 832x480 resolution, we observed that direct inference on 1280x720 is usually allowed and produces satisfactory results.
144
+
145
+
146
+ For long video generation, run the following comment:
147
+
148
+ ```
149
+ CUDA_VISIBLE_DEVICES="0" python examples/unianimate_wan/inference_unianimate_wan_long_video_720p.py
150
+ ```
151
+
152
+ ## Train
153
+
154
+ We support UniAnimate-DiT training on our own dataset.
155
+
156
+ ### Step 1: Install additional packages
157
+
158
+ ```
159
+ pip install peft lightning pandas
160
+ # deepspeed for multiple GPUs
161
+ pip install -U deepspeed
162
+ ```
163
+
164
+ ### Step 2: Prepare your dataset
165
+
166
+ In order to speed up the training, we preprocessed the videos, extracted video frames and corresponding Dwpose in advance, and packaged them with pickle package. You need to manage the training data as follows:
167
+
168
+ ```
169
+ data/example_dataset/
170
+ └── TikTok
171
+ └── 00001_mp4
172
+ β”œβ”€β”€ dw_pose_with_foot_wo_face.pkl # packaged Dwpose
173
+ └── frame_data.pkl # packaged frames
174
+ ```
175
+
176
+ We encourage adding large amounts of data to finetune models to get better results. The experimental results show that about 1000 training videos can finetune a good human image animation model.
177
+
178
+ ### Step 3: Train
179
+
180
+ For convenience, we do not pre-process VAE features, but put VAE pre-processing and DiT model training in a training script, and also facilitate data augmentation to improve performance. You can also choose to extract VAE features first and then conduct subsequent DiT model training.
181
+
182
+
183
+ LoRA training (One A100 GPU):
184
+
185
+ ```shell
186
+ CUDA_VISIBLE_DEVICES="0" python examples/unianimate_wan/train_unianimate_wan.py \
187
+ --task train \
188
+ --train_architecture lora \
189
+ --lora_rank 64 --lora_alpha 64 \
190
+ --dataset_path data/example_dataset \
191
+ --output_path ./models_out_one_GPU \
192
+ --dit_path "/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00001-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00002-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00003-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00004-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00005-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00006-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00007-of-00007.safetensors" \
193
+ --max_epochs 10 --learning_rate 1e-4 \
194
+ --accumulate_grad_batches 1 \
195
+ --use_gradient_checkpointing --image_encoder_path "/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" --use_gradient_checkpointing_offload
196
+ ```
197
+
198
+
199
+ LoRA training (Multi-GPUs, based on `Deepseed`):
200
+
201
+ ```shell
202
+ CUDA_VISIBLE_DEVICES="0,1,2,3" python examples/unianimate_wan/train_unianimate_wan.py \
203
+ --task train --train_architecture lora \
204
+ --lora_rank 128 --lora_alpha 128 \
205
+ --dataset_path data/example_dataset \
206
+ --output_path ./models_out --dit_path "/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00001-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00002-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00003-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00004-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00005-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00006-of-00007.safetensors,/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00007-of-00007.safetensors" \
207
+ --max_epochs 10 --learning_rate 1e-4 \
208
+ --accumulate_grad_batches 1 \
209
+ --use_gradient_checkpointing \
210
+ --image_encoder_path "/mnt/user/VideoGeneration_Baselines/Wan2.1/Wan2.1-I2V-14B-720P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" \
211
+ --use_gradient_checkpointing_offload \
212
+ --training_strategy "deepspeed_stage_2"
213
+ ```
214
+
215
+
216
+ You can also finetune our trained model by set `--pretrained_lora_path="./checkpoints/UniAnimate-Wan2.1-14B-Lora.ckpt"`.
217
+
218
+ ### Step 4: Test
219
+
220
+ Test the LoRA finetuned model trained on one GPU:
221
+
222
+ ```python
223
+ import torch
224
+ from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoData, WanUniAnimateVideoPipeline
225
+
226
+
227
+ # Load models
228
+ model_manager = ModelManager(device="cpu")
229
+ model_manager.load_models(
230
+ ["Wan2.1/Wan2.1-I2V-14B-720P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth"],
231
+ torch_dtype=torch.float32, # Image Encoder is loaded with float32
232
+ )
233
+ model_manager.load_models(
234
+ [
235
+ [
236
+
237
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00001-of-00007.safetensors",
238
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00002-of-00007.safetensors",
239
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00003-of-00007.safetensors",
240
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00004-of-00007.safetensors",
241
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00005-of-00007.safetensors",
242
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00006-of-00007.safetensors",
243
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00007-of-00007.safetensors",
244
+
245
+ ],
246
+ "Wan2.1/Wan2.1-I2V-14B-720P/models_t5_umt5-xxl-enc-bf16.pth",
247
+ "Wan2.1/Wan2.1-I2V-14B-720P/Wan2.1_VAE.pth",
248
+ ],
249
+ torch_dtype=torch.bfloat16,
250
+ )
251
+
252
+ model_manager.load_lora_v2("models/lightning_logs/version_1/checkpoints/epoch=0-step=500.ckpt", lora_alpha=1.0)
253
+
254
+ ...
255
+ ...
256
+ ```
257
+
258
+ Test the LoRA finetuned model trained on multi-GPUs based on Deepspeed, first you need `python zero_to_fp32.py . output_dir/ --safe_serialization` to change the .pt files to .safetensors files, and then run:
259
+
260
+ ```python
261
+ import torch
262
+ from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoData, WanUniAnimateVideoPipeline
263
+
264
+
265
+ # Load models
266
+ model_manager = ModelManager(device="cpu")
267
+ model_manager.load_models(
268
+ ["Wan2.1/Wan2.1-I2V-14B-720P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth"],
269
+ torch_dtype=torch.float32, # Image Encoder is loaded with float32
270
+ )
271
+ model_manager.load_models(
272
+ [
273
+ [
274
+
275
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00001-of-00007.safetensors",
276
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00002-of-00007.safetensors",
277
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00003-of-00007.safetensors",
278
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00004-of-00007.safetensors",
279
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00005-of-00007.safetensors",
280
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00006-of-00007.safetensors",
281
+ "Wan2.1/Wan2.1-I2V-14B-720P/diffusion_pytorch_model-00007-of-00007.safetensors",
282
+
283
+ ],
284
+ "Wan2.1/Wan2.1-I2V-14B-720P/models_t5_umt5-xxl-enc-bf16.pth",
285
+ "Wan2.1/Wan2.1-I2V-14B-720P/Wan2.1_VAE.pth",
286
+ ],
287
+ torch_dtype=torch.bfloat16,
288
+ )
289
+
290
+ model_manager.load_lora_v2([
291
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00001-of-00011.safetensors",
292
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00002-of-00011.safetensors",
293
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00003-of-00011.safetensors",
294
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00004-of-00011.safetensors",
295
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00005-of-00011.safetensors",
296
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00006-of-00011.safetensors",
297
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00007-of-00011.safetensors",
298
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00008-of-00011.safetensors",
299
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00009-of-00011.safetensors",
300
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00010-of-00011.safetensors",
301
+ "./models/lightning_logs/version_0/checkpoints/epoch=0-step=500.ckpt/output_dir/model-00011-of-00011.safetensors",
302
+ ], lora_alpha=1.0)
303
+
304
+ ...
305
+ ...
306
+ ```
307
+
308
+
309
+ ## Citation
310
+
311
+ If you find this codebase useful for your research, please cite the following paper:
312
+
313
+ ```
314
+ @article{wang2025unianimate,
315
+ title={UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation},
316
+ author={Wang, Xiang and Zhang, Shiwei and Gao, Changxin and Wang, Jiayu and Zhou, Xiaoqiang and Zhang, Yingya and Yan, Luxin and Sang, Nong},
317
+ journal={Science China Information Sciences},
318
+ year={2025}
319
+ }
320
+ ```
321
+
322
+
323
+ ## Disclaimer
324
+
325
+ This project is intended for academic research, and we explicitly disclaim any responsibility for user-generated content. Users are solely liable for their actions while using the generative model. The project contributors have no legal affiliation with, nor accountability for, users' behaviors. It is imperative to use the generative model responsibly, adhering to both ethical and legal standards.
dw-ll_ucoco_384.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:724f4ff2439ed61afb86fb8a1951ec39c6220682803b4a8bd4f598cd913b1843
3
+ size 134399116
yolox_l.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7860ae79de6c89a3c1eb72ae9a2756c0ccfbe04b7791bb5880afabd97855a411
3
+ size 216746733