hunyuan-video-i2v-720p-Q8_0.gguf runs out of memory, but Kijai's doesnt
I'm using the hunyuan-video-i2v-720p-Q8_0.gguf model from this repository with the "V2" guidance type with the HunyuanImageToVideo node in ComfyUI. However, this model causes OOM when it gets to the sampling step.
But, if I use Kijai's GGUF model from here, I don't get any OOM:
https://huggingface.co/Kijai/HunyuanVideo_comfy/blob/main/hunyuan_video_I2V-Q8_0.gguf
With Kijai's model, I can generate the 129 frames, and come in just under 24GB of VRAM on my 4090 (96%) at 720x720 with 20 steps. This must be the old V1 model, as it uses the V1 guidance type. But I seem to be getting good results from it.
Can you think of any reason why this is?
Hmm, not sure. It's indeed the original V1 model, though unsure why the V2 would use more vram unless the different guidance algo is somehow more taxing.
Only two real solutions I can think of is offloading the text encoder to the CPU to make sure it's not memory management weirdness, or alternatively testing with ComfyUI-MultiGPU (despite the name, it works for single GPU setups as well, and seems to allow manually offloading parts of the model instead of relying on the auto logic).
That or trying a lower quant, at least based on this quick test they seem mostly fine, though Q8_0 is definitely the closest.
Thanks. I tried your suggestions, but no luck. After the text encoder runs the result should be cached, shouldn't it? Subsequent runs skip past the TE step (as long as it's left untouched).
Check this out. I tried the "fixed" FP8 .safetensors model from Kijai:
https://huggingface.co/Kijai/HunyuanVideo_comfy/blob/main/hunyuan_video_I2V_720_fixed_fp8_e4m3fn.safetensors
This one works fine with V2 guidance! In fact, it uses slightly less VRAM than Kijai's Q8_0 GGUF model (the V1 model). But I'm not sure how this model was created or if we have to deal with a quality reduction versus the Q8_0 model that I assume was quantized directly from Tencent's FP16 model.
It seems the problem has something to do with the GGUF file specifically taking up too much VRAM, although this isn't a problem with the V1 model and it appears they are the same size.
TE should be offloaded after the first run, yeah, though depending on the vram amount it might only get partially unloaded instead of fully. On the default CLIP loader you can right click -> show advanced and select the load device as cpu on recent versions iirc. Worth a try for debugging.
FP8 is slightly smaller than Q8_0 (13.2GB vs 14.0GB) so that's kinda expected. It's also usually worse for quality, though it looks like Kijai did a mixed FP8 quant with the more important layers in BF16, which should help.
Looking at the weights, it looks like all models just use the BF16 model as a base, which makes sense. I don't think there's any differences there.
You could try the V1 model from this repo if you want to verify whether it's a V1 vs V2 or a quant issue. I in-place reuploaded the V2 models, but you can still select the V1 ones via the dropdown:
https://huggingface.co/city96/HunyuanVideo-I2V-gguf/blob/original/hunyuan-video-i2v-720p-Q8_0.gguf
Yeah, I tried your V1 GGUF model from the "original" repo, and that one works fine. After some messing around and updating things, I was able to get the V2 GGUF model to run a few times (720x720 starts at 94% VRAM usage, then eventually creeps up to 98%). I'm running the display off the iGPU, so there's nothing else using the 4090. I'm using the ComfyBootlegOffload.py script to run the TE on the CPU, which I think is actually helping so thanks for the tip.
What's more troubling though is the V2 model generates really crappy output using euler/simple. This is the case with both the V2 Q8_0 GGUF and Kijai's FP8 one. The output video is very stuttery, sometimes duplicating frames for an effective FPS of 12, and the brightness is oftentimes all over the place. I tried messing with the shift value, but that doesn't seem to help. The Hunyuan docs say that you need to raise shift from the default 7.0 if you use less than the default 50 steps. But I don't see this issue with the V1 model.
I'm not sure what I'm doing wrong, but it seems like a sampler problem (is the num steps or the sampler type?). Output from the V1 model is actually very smooth and consistent at 20 steps, but it's a little overexposed. I based my workflow off this one (the V2 workflow):
https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_video/
I'd like to use the V2 model because the exposure looks more like the input image (when it's consistent and working correctly, anyway).
EDIT:
Changing samplers seems to help. euler_ancestral and dpmpp_2m get rid of the jittery output, but sometimes create crazy camera moves. I'll report back what works best after some more messing around.
I'm trying this experiment on a 4090 laptop this time, so I'm limited to 16GB of VRAM. I'm running out of memory with the 6 bit models (Q5 works). But I'm also using the FP16 llama text encoder, offloaded to CPU using the "advanced" trick you mentioned with the CLIP loader. Would there be any VRAM advantage to using the Q8 text encoder? You mentioned that even though I choose CPU as the offload device, it may still get partially loaded to VRAM. Every MB counts...