Spaces:

lshzhm
/

Video-to-Audio-and-Piano

Running

App Files Files Community

Video-to-Audio-and-Piano / src /audeo /README.md

lshzhm

Upload 141 files

1991049 verified about 1 month ago

preview code

raw

history blame

6.03 kB

	# Audeo

	## Introduction
	This repository contains the code for the paper "Audeo: Audio Generation for a Silent Performance Video", which is avilable [here](https://proceedings.neurips.cc/paper/2020/file/227f6afd3b7f89b96c4bb91f95d50f6d-Paper.pdf), published in NeurIPS 2020. More samples can be found in our [project webpage](http://faculty.washington.edu/shlizee/audeo/) and [Youtube Video](https://www.youtube.com/watch?v=8rS3VgjG7_c).

	[![Alt text](https://img.youtube.com/vi/8rS3VgjG7_c/0.jpg)](https://www.youtube.com/watch?v=8rS3VgjG7_c)

	## Abstract
	We present a novel system that gets as an input, video frames of a musician playing the piano, and generates the music for that video. The generation of music from
	visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a
	transformation and to identify cues and components able to carry the association of sounds with visual events. To achieve the transformation we built a full pipeline
	named ‘Audeo’ containing three components. We first translate the video frames of the keyboard and the musician hand movements into raw mechanical musical
	symbolic representation Piano-Roll (Roll) for each video frame which represents the keys pressed at each time step. We then adapt the Roll to be amenable for audio
	synthesis by including temporal correlations. This step turns out to be critical for meaningful audio generation. In the last step, we implement Midi synthesizers
	to generate realistic music. Audeo converts video to audio smoothly and clearly with only a few setup constraints. We evaluate Audeo on piano performance videos
	collected from Youtube and obtain that their generated music is of reasonable audio quality and can be successfully recognized with high precision by popular
	music identification software.

	## Data
	We use Youtube Channel videos recorded by [Paul Barton](https://www.youtube.com/user/PaulBartonPiano) to evaluate the Audeo pipeline. For Pseudo Midi Evaluation, we use 24 videos of Bach Well-Tempered Clavier Book One (WTC B1). The testing set contains the first 3 Prelude and Fugue performances of Bach Well-Tempered Clavier Book Two (WTC B2) The Youtube Video Id can be found in [here](https://github.com/shlizee/Audeo/blob/master/Video_Id.md). For Audio Evaluation, we use 35 videos from WTC B2 (24 Prelude and Fugue pairs and their 11 variants), 8 videos from WTC B1 Variants, and 9 videos from other composers. Since we cannot host the videos due to copyright issues, you need to download the videos yourself.

	All videos are set at the frame of 25 fps and the audio sampling rate of 16kHz. The Pseudo GT Midi are obtained via [Onsets and Frames framework (OF)](https://github.com/magenta/magenta/tree/master/magenta/models/onsets_frames_transcription). We process all videos and keep the full keyboard only and remove all frames that do not contribute to the piano performance (e.g., logos, black screens, etc). The cropped piano coordinates can be found in [here](https://github.com/shlizee/Audeo/blob/master/piano_coords.py) (The order is the same as in Video_Id file. We trim the initial silent sections up to the first frame in which the first key is being pressed, to align the video, Pseudo GT Midi, and the audio. All silent frames inside each performance are kept.

	For your convenience, we provide the following folders/files in [Google Drive](https://drive.google.com/drive/folders/1w9wsZM-tPPUVqwdpsefEkrDgkN3kfg7G?usp=sharing):
	- input_images: examples of how the images data should look like.
	- labels: training and testing labels of for training/testing Video2Roll Net. Each folder contains a pkl file for one video. The labels are dictionaries where key is the frame number and value is a 88 dim vector. See Video2Roll_dataset.py for more details.
	- OF_midi_files: the original Pseudo ground truth midi files obtained from Onsets and Frames Framework.
	- midi: we process the Pseudo GT Midi files to 2D matrix (Piano keys x Time) and down-sampled to 25 fps. Then for each video, we divide them into multiple 2 seconds (50 frames) segments. For example 253-303.npz includes the 2D matrix from frame 253 to frame 302.
	- estimate_Roll: the Roll predictions obtained from Video2Roll Net. Same format as the midi. You can directly use them for training Roll2Midi Net.
	- Roll2Midi_results: the Midi predictions obtained from Roll2Midi Net. Same format as the midi and estimate_Roll. Ready for Midy Synth.
	- Midi_Synth: synthesized audios from Roll2Midi_results.
	- Video2Roll_models: contains the pre-trained Video2RollNet.pth.
	- Roll2Midi_models: contains the pre-trained Roll2Midi Net.

	## How to Use
	- Video2Roll Net
	1. Please check the Video2Roll_dataset.py and make sure you satisfy the data formats.
	2. Run Video2Roll_train.py for training.
	3. Run Video2Roll_evaluate.py for evaluation.
	4. Run Video2Roll_inference.py to generate Roll predictions.
	- Roll2Midi Net
	1. Run Roll2Midi_train.py for training.
	2. Run Roll2Midi_evaluate.py for evaluation.
	2. Run Roll2Midi_inference.py to generate Midi predictions.
	- Midi Synth
	1. Run Midi_synth.py to use Fluid Synth to synthesize audio.

	## Requirements
	- Pytorch >= 1.6
	- Python 3
	- numpy 1.19
	- scikit-learn 0.22.1
	- librosa 0.7.1
	- pretty-midi 0.2.8

	## Citation

	Please cite ["Audeo: Audio Generation for a Silent Performance Video"](https://proceedings.neurips.cc/paper/2020/file/227f6afd3b7f89b96c4bb91f95d50f6d-Paper.pdf) when you use this code:
	```
	@article{su2020audeo,
	title={Audeo: Audio generation for a silent performance video},
	author={Su, Kun and Liu, Xiulong and Shlizerman, Eli},
	journal={Advances in Neural Information Processing Systems},
	volume={33},
	year={2020}
	}
	```