Generating sound effects for videos often requires artistic design that goes far beyond simply mirroring reality. Sound designers need flexible control options to create the desired atmosphere. A novel model called MultiFoley addresses this challenge and enables video-guided generation of Foley sounds using multimodal control via text, audio, and video.
MultiFoley allows users, starting with a silent video and text input, to generate clean sounds (e.g., skateboard wheels without wind noise) or imaginative sounds (e.g., a lion's roar that sounds like a cat's meow). Furthermore, reference audio data from sound effect libraries or partial videos can be used as a basis. The special feature of MultiFoley lies in its joint training on internet video data with low audio quality and professional sound effect recordings. This enables the generation of high-quality, full-bandwidth audio (48kHz).
MultiFoley offers a wide range of applications. Through text input, users can adjust the audio content, regardless of whether it matches the visual content. For example, unwanted elements like wind noise can be removed or sounds can be replaced by others. In addition to text, the system also accepts various types of audio and audiovisual conditioning. Users can generate desired sound effects from reference audio data in the sound effect library or extend the soundtrack from a part of a video.
One of the biggest challenges in developing MultiFoley was that internet video soundtracks are often poorly matched to the visual content and of low quality (e.g., noisy audio and limited bandwidth). To address this, MultiFoley was jointly trained on high-quality sound effect libraries (audio-text pairs) and internet videos, using speech supervision in both cases. This approach enables the model to generate full-bandwidth audio that meets professional standards and allows for precise text-based customization.
Previous approaches to Foley generation have mainly formulated the problem as a video-to-audio generation task. However, these restrict the audio content and do not offer the control needed by designers, as they often generate sounds that deviate from real life in many ways. Existing systems offer only limited control options (e.g., conditional video examples or speech) and have problems with audio quality and synchronization.
MultiFoley, on the other hand, integrates multimodal controls and thus offers greater flexibility. Through joint training across audio, video, and text modalities, MultiFoley enables flexible control and various Foley applications. Automated evaluations and user studies show that MultiFoley successfully generates synchronized, high-quality sounds across various conditional inputs and outperforms existing methods.
MultiFoley consists of a diffusion transformer, a high-quality audio autoencoder, a frozen video encoder for audio-video synchronization, and a novel multi-conditional training strategy that enables flexible downstream tasks such as audio extension and text-driven sound design.
MultiFoley represents a significant advance in the field of video-guided audio generation. The multimodal control and high audio quality open up new possibilities for creative sound design and considerably simplify the work of sound designers. The ability to precisely adapt sounds to videos while achieving both realistic and imaginative results could fundamentally change the way we experience audio in videos.
Bibliography https://arxiv.org/abs/2411.17698 https://paperswithcode.com/paper/video-guided-foley-sound-generation-with https://arxiv.org/html/2411.17698 https://ificl.github.io/MultiFoley/ https://deeplearn.org/arxiv/552454/video-guided-foley-sound-generation-with-multimodal-controls https://chatpaper.com/chatpaper/paper/85419 https://github.com/showlab/Awesome-Video-Diffusion https://www.reddit.com/r/ninjasaid13/comments/1h0v34u/241117698_videoguided_foley_sound_generation_with/ https://arxiv-sanity-lite.com/ https://huggingface.co/papers/2409.06135