VideoModelStudio / NOTES.md
jbilcke-hf's picture
jbilcke-hf HF Staff
some ai notes
a9df757

A newer version of the Gradio SDK is available: 5.36.2

Upgrade

Video Model Training Notes

Training Step Analysis

What happens in a training step?

A training step processes exactly batch_size samples (not the entire dataset). Here's what happens:

Per Training Step:

  • Processes batch_size videos/samples (configurable, typically 1-8)
  • Uses smart batching that groups videos by resolution dimensions
  • Two data streams: text embeddings + video latents

Key Points:

  • With 100 videos and batch_size=4: each step processes 4 videos
  • Training runs for a fixed number of steps (not epochs)
  • Dataset loops infinitely, so videos are reused across steps
  • Uses ResolutionSampler to batch videos of similar dimensions together

Training Loop Structure:

  1. Load next batch_size samples from dataset
  2. Group by resolution (spatial + temporal dimensions)
  3. Forward pass through transformer (denoising)
  4. Calculate loss and update weights
  5. Increment step counter

So if you have 100 videos and batch_size=1, step 1 processes video 1, step 2 processes video 2, etc. When it reaches video 100, it loops back to video 1.

Avoiding Overfitting

For video model training, a good rule of thumb is to keep each video seen less than 10-50 times during training to avoid overfitting.

Common thresholds:

  • Conservative: <10 times per video (strong generalization)
  • Moderate: 10-50 times per video (balanced)
  • Risky: >100 times per video (likely overfitting)

With low learning rates (e.g., 0.00004):

  • Lower LR means you can potentially see videos more times safely
  • But still better to err on the side of caution

Practical calculation:

  • If training for 10,000 steps with batch_size=1:
    • 100 videos = 100 times each (risky)
    • 500 videos = 20 times each (moderate)
    • 1,000+ videos = <10 times each (conservative)

Early stopping indicators:

  • Training loss continues decreasing but validation loss plateaus/increases
  • Generated videos start looking too similar to training examples
  • Loss of diversity in outputs

With low learning rates, staying under 20-30 times per video should be relatively safe, but <10 times is ideal for strong generalization.