Spaces:

lshzhm
/

Video-to-Audio-and-Piano

Running

App Files Files Community

Video-to-Audio-and-Piano / README.md

lshzhm

Update README.md

84388f3 about 1 month ago

preview code

raw

history blame contribute delete

1.81 kB

	---
	title: Video-to-Audio-and-Piano
	emoji: 🔊
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: false
	---


	## Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization
	## Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

	## Results

	1. Results of Video-to-Audio Synthesis

	https://github.com/user-attachments/assets/d6761371-8fc2-427c-8b2b-6d2ac22a2db2

	https://github.com/user-attachments/assets/50b33e54-8ba1-4fab-89d3-5a5cc4c22c9a

	2. Results of Video-to-Piano Synthesis

	https://github.com/user-attachments/assets/b6218b94-1d58-4dc5-873a-c3e8eef6cd67

	https://github.com/user-attachments/assets/ebdd1d95-2d9e-4add-b61a-d181f0ae38d0


	## Installation

	1. Create a conda environment

	```bash
	conda create -n v2ap python=3.10
	conda activate v2ap
	```

	2. Install requirements

	```bash
	pip install -r requirements.txt
	```


	Pretrained models

	The models are available at https://huggingface.co/lshzhm/Video-to-Audio-and-Piano/tree/main.


	## Inference

	1. Video-to-Audio inference

	```bash
	python src/inference_v2a.py
	```

	2. Video-to-Piano inference

	```bash
	python src/inference_v2p.py
	```

	## Dateset is in progress


	## Metrix


	## Acknowledgement

	- [Audeo](https://github.com/shlizee/Audeo) for video to midi prediction
	- [E2TTS](https://github.com/lucidrains/e2-tts-pytorch) for CFM structure and base E2 implementation
	- [FLAN-T5](https://huggingface.co/google/flan-t5-large) for FLAN-T5 text encode
	- [CLIP](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) for CLIP image encode
	- [AudioLDM Eval](https://github.com/haoheliu/audioldm_eval) for audio evaluation