ARC-Qwen-Video-7B
In this version, we have switched the base model from hunyuan VLM in ARC-Hunyuan-Video-7B to Qwen2.5-VL-7B-Instruct and introduce ARC-Qwen-Video-7B for understanding real-world short videos. We used the same training data and training stages. For a detailed introduction, please refer to ARC-Hunyuan-Video-7B. The main distinctions are listed as below,
Feature | ARC-Hunyuan-Video-7B |
ARC-Qwen-Video-7B |
---|---|---|
Base VLM | Hunyuan-VL-7B-Pretrain | Qwen2.5-VL-7B-Instruct |
Frame Resolution Each model uses a fixed frame resolution to maintain audio-video synchronization. |
Fixed at 640 x 640 |
Fixed at 392 x 292 |
Frame Sampling | • < 150s: 1 FPS • > 150s: Uniformly sample 150 frames. |
• < 300s: 1 FPS • > 300s: Uniformly sample 300 frames. |
Audio-Video Synchronization | • < 150s: Sum tokens from 1s audio + 1s video frame. • 150-300s: Sum tokens from corresponding audio segment + video frame. • > 300s: Split audio into 300 segments, use first 2s of each. |
• < 300s: Sum tokens from 1s audio + 1s video. • > 300s: Split audio into 300 segments, use middle 1s of each. |
We are also introducing a new model, ARC-Qwen-Video-7B-Narrator. It can output timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video):
Usage
Dependencies
The installation has been tested and verified on the following environments:
- NVIDIA H20 with CUDA 12.4
- NVIDIA A100 with CUDA 12.1
Installation
Clone the repo and install dependent packages
git clone -b arc-qwen-video https://github.com/TencentARC/ARC-Hunyuan-Video-7B.git
cd ARC-Hunyuan-Video-7B
# Install torch 2.6.0 based on your CUDA version
# CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# CUDA 12.6
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install librosa decord av accelerate
pip uninstall transformers
pip install git+https://github.com/geyuying/transformers.git@arc-qwen-video
pip install flash_attn==2.7.1.post4
# Install FFmpeg according to your system, and ensure that the following command produces a normal version output:
ffmpeg -version
# (Optional) For vllm, please follow the instructions below,
pip uninstall vllm
pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
An 'Ugly' Workaround for vLLM Installation
If you are unable to install our provided vllm package, we offer an alternative "ugly" method:
Install vllm with Qwen2.5-VL support.
Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen2_5_VLForConditionalGeneration".
Patch the vllm source code. Locate the file vllm/model_executor/models/qwen2_5_vl.py in your vllm installation path. Add the following code inside the init method of the Qwen2_5_VLForConditionalGeneration class:
whisper_path = 'openai/whisper-large-v3'
speech_encoder = WhisperModel.from_pretrained(whisper_path).encoder
self.speech_encoder = speech_encoder
speech_dim = speech_encoder.config.d_model
llm_hidden_size = config.vision_config.out_hidden_size
self.mlp_speech = nn.Sequential(
nn.LayerNorm(speech_dim),
nn.Linear(speech_dim, llm_hidden_size),
nn.GELU(),
nn.Linear(llm_hidden_size, llm_hidden_size)
)
Why this works: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5.
Inference
# Our model currently excels at processing short videos of up to 5 minutes.
# If your video is longer, we recommend following the approach used in our demo and API:
# split the video into segments for inference, and then use an LLM to integrate the results.
To quickly verify that your environment is set up correctly and that video and audio information are being processed as expected, you can run the following test case with ARC-Qwen-Video-7B.
video_path = "examples/猪排.mp4"
task = "QA"
question = "What did the man say at the beginning of the video after measuring the thickness of the fried pork cutlet?"
Expected Result: If the model's output contains the phrase "So thin", it indicates that your installation is successful.
Inference without vllm
cd ARC-Hunyuan-Video-7B
# For ARC-Hunyuan-Video-7B
python3 inference_arc_qwen_video.py
# For ARC-Hunyuan-Video-7B-Narrator
python3 inference_arc_qwen_video_narrator.py
Inference with vllm
cd ARC-Hunyuan-Video-7B
# For ARC-Hunyuan-Video-7B
python3 vllm_arc_qwen_vl_video_batch.py --batch_inference
# For ARC-Hunyuan-Video-7B-Narrator
python3 vllm_arc_qwen_vl_video_batch_narrator.py --batch_inference
Benchmark Performance
Video-MMMU | MMVU | Temp-Compass | Video-Holmes | Video-MME | VCR-Bench | MV-Bench | ShortVid-Bench | Charades-STA | |
---|---|---|---|---|---|---|---|---|---|
ARC-Hunyuan-Video-7B | 31.1 | 49.1 | 66.0 | 40.9 | 58.7 | 50.5 | 62.6 | 73.0 | 54.8 |
ARC-Qwen-Video-7B | 41.3 | 55.5 | 68.7 | 51.1 | 61.0 | 52.3 | 60.8 | 72.6 | 52.8 |
Quantitative evaluation is performed on different benchmarks using accuracy as the evaluation metric, except for the grounding task on Charades-STA, which uses mIoU. For all benchmarks other than VideoMMMU and Charades-STA, we only evaluated the multiple-choice questions.
Citation
If you find the work helpful, please consider citing:
@article{ge2025arc,
title={ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts},
author={Ge, Yuying and Ge, Yixiao and Li, Chen and Wang, Teng and Pu, Junfu and Li, Yizhuo and Qiu, Lu and Ma, Jin and Duan, Lisheng and Zuo, Xinyu and others},
journal={arXiv preprint arXiv:2507.20939},
year={2025}
}
- Downloads last month
- 500
Model tree for TencentARC/ARC-Qwen-Video-7B
Base model
Qwen/Qwen2.5-VL-7B-Instruct