Spaces:

jbilcke-hf
/

VideoModelStudio

Running

App Files Files Community

VideoModelStudio / NOTES.md

jbilcke-hf HF Staff

some ai notes

a9df757 1 day ago

preview code

raw

history blame contribute delete

2.16 kB

	# Video Model Training Notes

	## Training Step Analysis

	### What happens in a training step?

	A training step processes exactly `batch_size` samples (not the entire dataset). Here's what happens:

	Per Training Step:
	- Processes `batch_size` videos/samples (configurable, typically 1-8)
	- Uses smart batching that groups videos by resolution dimensions
	- Two data streams: text embeddings + video latents

	Key Points:
	- With 100 videos and batch_size=4: each step processes 4 videos
	- Training runs for a fixed number of steps (not epochs)
	- Dataset loops infinitely, so videos are reused across steps
	- Uses ResolutionSampler to batch videos of similar dimensions together

	Training Loop Structure:
	1. Load next `batch_size` samples from dataset
	2. Group by resolution (spatial + temporal dimensions)
	3. Forward pass through transformer (denoising)
	4. Calculate loss and update weights
	5. Increment step counter

	So if you have 100 videos and batch_size=1, step 1 processes video 1, step 2 processes video 2, etc. When it reaches video 100, it loops back to video 1.

	## Avoiding Overfitting

	For video model training, a good rule of thumb is to keep each video seen less than 10-50 times during training to avoid overfitting.

	Common thresholds:
	- Conservative: <10 times per video (strong generalization)
	- Moderate: 10-50 times per video (balanced)
	- Risky: >100 times per video (likely overfitting)

	With low learning rates (e.g., 0.00004):
	- Lower LR means you can potentially see videos more times safely
	- But still better to err on the side of caution

	Practical calculation:
	- If training for 10,000 steps with batch_size=1:
	- 100 videos = 100 times each (risky)
	- 500 videos = 20 times each (moderate)
	- 1,000+ videos = <10 times each (conservative)

	Early stopping indicators:
	- Training loss continues decreasing but validation loss plateaus/increases
	- Generated videos start looking too similar to training examples
	- Loss of diversity in outputs

	With low learning rates, staying under 20-30 times per video should be relatively safe, but <10 times is ideal for strong generalization.