MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization

Jianxuan Yang1*, Xiaoran Yang1,2*, Lipan Zhang1, Xinyue Guo1, Zhao Wang1, Gongping Huang2
1MiLM Plus, Xiaomi Inc., China
2School of Electronic Information, Wuhan University, Wuhan, China
*Equal contribution, †Corresponding author
Github arXiv Model

Abstract

Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization.

Abstract Figure
MultiSoundGen: a novel V2A framework for multi-event scenarios.

Samples

Samples for multi-event V2A

Demo 1: Movie scene—an intense motorcycle chase

Ground Truth
MultiSoundGen
MMAudio
FoleyCrafter
Seeing&Hearing
V_AURA
Comparison of V2A results for Demo 1
Comparison of V2A results. MultiSoundGen achieves best audio-visual alignment.

Demo 2: Unconventional two-person music performance

Ground Truth
MultiSoundGen

Demo 3: Two dogs making different types of barking sounds

Ground Truth
MultiSoundGen

Samples for general scenario V2A

Animal

Lion
Rooster
Puppy

Domestic sounds

Typewriter
Leaf blower
Sewing Machine

Human sounds

Tap dance
Jump Rope
Chew

Music

Double Bass
Conga Drums
Snare Drum

Natural sounds

Sea Wave
Rain
Fire

Tools

Chop
Woodturn
Sharpen

Vehicle

Helicopter
Car
Railway

Acknowledgements

Most videos used in this demo page are from VGGSound dataset.

Some videos are downloaded from the internet. These videos are used solely for demonstration purposes and we do not claim any copyright. If any content infringes upon your rights, please contact us and we will remove it immediately.