arXiv Papers of Text-to-Audio Generation

Abstract:
We present a real‑time musical interface that converts natural‑language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct parameter adjustments ‑ stepping brightness down, switching a rhythm style ‑ each producing a predictable, audible shift without re‑prompting. Where GPU‑bound text‑to‑audio systems synthesize monolithic waveforms, our instrument generates human‑readable configurations over a categorical schema, enabling fine‑grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends ‑ embedding retrieval for sub‑second CPU‑only use, hosted LLMs via API, and a fine‑tuned 270M local model ‑ all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5‑12 seconds to respond, the audience hears uninterrupted sound ‑ reframing text‑to‑music as an ongoing performable stream rather than a one‑shot generation. We evaluate text‑audio semantic alignment using LAION‑CLAP on held‑out prompts as a technical proxy, finding that retrieval‑based configuration outperforms random valid configurations on this metric, while noting that LAION‑CLAP also informed retrieval‑map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.

Abstract:
Existing multi‑speaker dialogue systems bind speakers to utterances through structured supervision: per‑turn tags, multi‑stream transcriptions, or learnable speaker embeddings. These systems operate within speech‑only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text‑to‑audio flow‑matching foundation model, pretrained on large‑scale in‑the‑wild data, directly on multiple reference voices and a free‑form natural language prompt that describes an entire multi‑speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non‑studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi‑speaker control without any per‑turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity‑aware positional encodings. However, we identify a critical obstacle to this approach: the Reference Shortcut. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high‑noise‑biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2‑Dialogue benchmark, showing that it outperforms existing multi‑speaker systems on speaker‑binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general‑purpose audio model conditioned on a free‑form scene description, rather than passing structured dialog scripts through a speech‑only pipeline.

Abstract:
We present FoleyGenEx, a unified video‑to‑audio (VTA) framework integrating multi‑modal control, frame‑level temporal alignment, and fine‑grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi‑modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio‑controlled VTA and Foley extension, a multi‑modal dynamic masking strategy preserving training synchronization, and an adverb‑based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.

Abstract:
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large‑scale, high‑quality training data, and 3) the prohibitive inference cost of multi‑step diffusion sampling. As such, we propose AudioX‑Turbo, a unified and efficient framework for anything‑to‑audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX‑Turbo follows a teacher‑student paradigm. The teacher AudioX‑Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high‑fidelity synthesis, and is then distilled into the few‑step student AudioX‑Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion‑based discriminator for high‑quality few‑step generation. To support the training of AudioX‑Turbo, we construct a large‑scale, high‑quality dataset, IF‑caps‑Pro, comprising approximately 9.2M samples curated through a two‑stage data collection and annotation pipeline. We benchmark AudioX‑Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text‑to‑audio and text‑to‑music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi‑step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction‑following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX‑Turbo/.

Abstract:
The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open‑source models remains limited by the scarcity of high‑quality training data. To bridge this gap, we introduce CineDance‑1M, a large‑scale, open research Text‑to‑Audio‑Video (T2AV) dataset designed specifically for multi‑shot, long‑form joint audio‑video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three‑stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film‑theory‑inspired narrative parsing, and iii) hierarchical dual‑modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six‑dimensional, human‑aligned metric system tailored for complex narrative audio‑video evaluation. Furthermore, we adapt LTX‑2.3 into CineDance, which demonstrates exceptional single‑modality quality alongside precise audio‑video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance‑1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi‑shot, long‑form joint audio‑video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

Abstract:
In recent years, audio generation has made significant progress in tasks such as text‑to‑speech (TTS), text‑to‑audio (TTA) and text‑to‑music (TTM). However, generating long‑form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post‑production. In this work, we introduce Audio‑Oscar, a multi‑agent framework for generating audio from complex descriptions. Audio‑Oscar coordinates a set of specialist agents, each responsible for a different aspect of the audio scene, including character modeling and voice design, speech generation, fine‑grained timeline planning, model selection, non‑speech generation, and audio post‑production. Audio‑Oscar further incorporates feedback‑driven refinement. In addition, to address the lack of suitable benchmarks for evaluating audio generation from complex audio scene descriptions, we construct ASG‑Bench, an Audio Scene Generation Benchmark containing both scene descriptions paired with reference audio and text‑only scene descriptions. Each scene is annotated with target audio events and temporal statements to evaluate whether the generated audio faithfully realizes the required scene content and temporal structure. Experimental results show that Audio‑Oscar can effectively generate audio that matches complex scene descriptions. Project samples are available at https://audiooscar.github.io/. Our code is available at https://github.com/ziye26/Audio‑Oscar.

Abstract:
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high‑level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high‑dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low‑dimensional audio tokenizer for cross‑domain audio understanding and generation. Motivated by the observation that 1280‑dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time‑relation loss for temporal feature consistency. We further design a dual‑level semantic supervision method that leverages both high‑ and low‑dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low‑dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low‑dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.

Abstract:
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain‑specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine‑grained supervision for real‑world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed‑audio scenes from text. Dasheng AudioGen introduces structured multi‑view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine‑grained control over audio layers. Furthermore, we employ a high‑dimensional unified semantic‑acoustic representation as the shared latent space. It injects semantic priors that facilitate cross‑modal training convergence, while its high‑dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow‑matching DiT achieves high‑quality end‑to‑end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real‑world recordings in mixed‑audio categories, while remaining competitive with specialized models in single‑type generation tasks. Demos are available at https://nieeim.github.io/Dasheng‑AudioGen‑Web/.

Abstract:
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound‑source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text‑to‑Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference‑time dual‑phase sampling scheme for pretrained flow‑matching VT2A models. Phase 1 builds a video‑derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state‑of‑the‑art baselines. To evaluate replacement quality, we propose a metric leveraging a text‑audio co‑embedding space to measure both target‑prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin‑lee.github.io/counterflow‑demo/

Abstract:
Modern audio generation predominantly relies on latent‑space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high‑fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high‑dimensional and low‑energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x‑prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high‑quality video‑text‑audio triplets, allowing the model to learn fine‑grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video‑to‑audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text‑to‑audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent‑based methods. Our work demonstrates that intermediate compression is not a prerequisite for high‑quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.

Abstract:
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable‑length audio generation and editing. Since our models can generate several minutes of audio, variable‑length generations are key to avoid the cost of producing full‑length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic‑acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion‑based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post‑training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer‑grade hardware, together with their training and inference pipeline.

Abstract:
Current methods for creating drum loop audio in digital music production, such as using one‑shot samples or resampling, often demand non‑trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic‑to‑audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break‑the‑Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine‑tuning a pre‑trained text‑to‑audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target‑reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high‑quality drum audio that follows high‑resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break‑the‑beat/

Abstract:
In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2‑Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank‑1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism ‑‑ a residual Riemannian ground‑metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four‑axis protocol, coupling‑only comparisons at ε= 0.05 show that Sinkhorn's rank‑1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio‑quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per‑sample diagnostics with AUROC \ge 0.86, a capability that scalar‑ or kernel‑aggregated metrics structurally lack.

Abstract:
MiniMind‑O is an open 0.1B‑scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text‑to‑audio, image‑to‑text, and audio‑to‑audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four‑layer Talker made from MiniMind blocks. Frozen SenseVoice‑Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality‑placeholder positions. The Talker reads a middle‑layer Thinker state together with an autoregressive eight‑layer Mimi‑code buffer. Speaker control is handled by a dedicated speaker token, right‑aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio‑code context rather than a separate TTS module. With a 768‑dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker‑‑Talker consistency evaluation, with overall voice‑cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale‑critical design choices for small omni models: middle‑layer semantic bridging, a released multimodal sequence format, and a parameter‑efficient eight‑codebook interface.

Abstract:
Generative audio modeling has largely been fragmented into specialized tasks, text‑to‑speech (TTS), text‑to‑music (TTM), and text‑to‑audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow‑matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference‑free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme‑driven Multimodal Diffusion Transformer (MM‑DiT). Coupled with a multi‑stage curriculum learning strategy, this approach effectively mitigates cross‑modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state‑of‑the‑art performance in instruction‑based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single‑task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.

Abstract:
Recent advances in video‑to‑audio (V2A) generation enable high‑quality audio synthesis from visual content, yet achieving robust and fine‑grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual‑text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio‑temporal audio‑visual encoder to improve alignment and textual controllability. We further propose temporal‑timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality‑robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound‑TVC, a benchmark for evaluating textual controllability under varying degrees of visual‑text conflict. Extensive experiments demonstrate state‑of‑the‑art performance across multiple V2A tasks, including text‑guided, text‑controlled, and audio‑controlled generation. ControlFoley achieves superior controllability under cross‑modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx‑research.github.io/ControlFoley/.

Abstract:
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio‑Omni, the first end‑to‑end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi‑modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high‑level reasoning with a trainable Diffusion Transformer for high‑fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large‑scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio‑Omni achieves state‑of‑the‑art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio‑Omni exhibits remarkable inherited capabilities, including knowledge‑augmented reasoning generation, in‑context generation, and zero‑shot cross‑lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio‑Omni.

Abstract:
Music accompaniment generation aims to automatically produce instrumental accompaniments that are rhythmically, harmonically, and timbrally coherent with a given vocal input, with broad applications in personalized music creation, arrangement assistance, and music education. Existing approaches, primarily operating in the symbolic domain or relying on single‑stage audio generation frameworks, commonly suffer from insufficient high‑level semantic structure modeling, limited acoustic detail reconstruction, and weak conditional controllability. To address these limitations, this paper proposes HAFM, a Hierarchical Autoregressive Foundation Model for vocal‑conditioned music accompaniment generation. The model employs a dual‑rate tokenization strategy in which 50 Hz HuBERT semantic tokens capture high‑level musical structure and 75 Hz EnCodec acoustic tokens encode fine‑grained acoustic content, enabling explicit disentanglement of semantic and acoustic representations. Building on this foundation, a three‑stage cascaded generation framework is designed to progressively generate semantic tokens, coarse acoustic tokens, and fine acoustic tokens, refining the accompaniment from global structure to local detail. . Objective evaluation on the MUSDB18 dataset demonstrates that the full three‑stage model achieves a Fréchet Audio Distance (FAD) score of 1.71, representing an 18.6% relative improvement over the two‑stage baseline (FAD = 2.10). Subjective listening tests show that the generated accompaniments achieve a 51.5% preference rate against ground‑truth accompaniments in head‑to‑head comparisons, and substantially outperform the random baseline in terms of rhythmic alignment, harmonic compatibility, and overall musical coherence. The source code and demo are available at https://github.com/HackerHyper/HAFM.git.

Abstract:
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on‑screen and off‑screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video‑conditioned audio generation models typically focus on producing on‑screen environmental sounds that correspond to visible sounding events, neglecting off‑screen auditory events. While recent holistic joint text‑video‑to‑audio generation models aim to produce auditory scenes with both on‑ and off‑screen sound but they are limited to non‑speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow‑matching‑based diffusion framework jointly conditioned on video and text. It features a TriAttn‑DiT architecture that performs three cross‑attention operations to process on‑screen environmental sound, off‑screen environmental sound, and speech conditions simultaneously, with a Mixture‑of‑Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen‑Bench, a new benchmark with over one thousand samples covering three representative on/off‑screen speech‑environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state‑of‑the‑art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

Abstract:
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high‑quality audio encoder/decoder model and (2) a text‑audio alignment model for conditioning, together with (3) text‑to‑audio and (4) video‑to‑audio generative models. Distilled text‑to‑audio and video‑to‑audio models are also included in the release, allowing for low‑resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio‑Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

Abstract:
Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX‑2, a joint audio‑visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image‑based in‑context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth‑ and pose‑guided generation, inpainting, and outpainting, and show competitive results on camera control and audio‑visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially‑aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio‑visual controls for a joint generation model. Our method is both compute‑ and data‑efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

Abstract:
Fréchet Audio Distance (FAD) is the de facto standard for evaluating text‑to‑audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task‑induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log‑scale normalization for fair cross‑encoder comparison. Controlled experiments on six encoders across two datasets reveal a four‑axis trade‑off: reconstruction‑based AudioMAE leads precision sensitivity; ASR‑trained Whisper dominates structural detection but is blind to signal degradation; classification‑trained VGGish maximizes semantic detection but penalizes legitimate intra‑class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation‑native encoders intrinsically aligned with human perception.

Abstract:
Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low‑level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text‑to‑audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.

Abstract:
Diffusion models have achieved remarkable progress in high‑fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition‑based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data‑partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves 2.31× and 2.07× latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U‑Net‑based diffusion models and DiT‑based flow‑matching architectures. Our approach also outperforms existing methods in acceleration under high‑resolution synthesis settings. Code is available at https://github.com/kaist‑dmlab/Hybridiff.

Authors: OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu Peng, Yang Gao, Yanru Huo, Ying Zhu, Yinze Luo, Yiyang Zhang, Yuerong Song, Zhe Xu, Zhiyu Zhang, Chenchen Yang, Cheng Chang, Chushu Zhou, Hanfu Chen, Hongnan Ma, Jiaxi Li, Jingqi Tong, Junxi Liu, Ke Chen, Shimin Li, Shiqi Jiang, Songlin Wang, Wei Jiang, Zhaoye Fei, Zhiyuan Ning, Chunguo Li, Chenhui Li, Ziwei He, Zengfeng Huang, Xie Chen, Xipeng Qiu

Abstract:
Audio is indispensable for real‑world video, yet generation models have largely overlooked audio components. Current approaches to producing audio‑visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed‑source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open‑source model capable of generating high‑quality, synchronized audio‑visual content, including realistic lip‑synced speech, environment‑aware sound effects, and content‑aligned music. MOVA employs a Mixture‑of‑Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image‑Text to Video‑Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine‑tuning, and prompt enhancement.

Abstract:
Modern voice cloning, also known as zero‑shot text‑to‑speech (TTS), can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practice, these systems often face noisy reference audio, imperfect text prompts, multilingual and long‑form generation, post‑processing, and adversarial perturbations, all of which can weaken robustness. Despite rapid progress in codec‑token language models and diffusion‑based TTS, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive dataset and benchmark for evaluating robustness in voice cloning. RVCBench provides task‑aligned tests covering controlled text‑audio pairing, multilingual and long‑form scenarios, expressive prompts, post‑processing conditions, and passive or proactive audio perturbations. Across 18 robustness evaluations, 225 speakers, and 14,370 utterances, RVCBench supports unified evaluation of input sensitivity, generation stability, output resilience, perturbation robustness, speaker similarity, and deepfake detectability. We evaluate 18 representative open‑source voice cloning models and reveal systematic vulnerabilities in content consistency, speaker similarity, long‑form stability, post‑processing resilience, adversarial robustness, and detector‑facing separability. We release the code and dataset to support reproducible evaluation and future research on robust voice cloning, speech synthesis, and audio generation. Code: https://github.com/Nanboy‑Ronan/RVCBench. Dataset: https://huggingface.co/datasets/Nanboy/RVCBench.

Abstract:
We introduce Mix2Morph, a text‑to‑audio diffusion model fine‑tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integrate qualities of both sources. We specifically target sound infusions, a practically and perceptually motivated subclass of morphing in which one sound acts as the dominant primary source, providing overall temporal and structural behavior, while a secondary sound is infused throughout, enriching its timbral and textural qualities. Objective evaluations and listening tests show that Mix2Morph outperforms prior baselines and produces high‑quality sound infusions across diverse categories, representing a step toward more controllable and concept‑driven tools for sound design. Sound examples are available at https://anniejchu.github.io/mix2morph .

Abstract:
While video‑to‑audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models' reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual‑to‑spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large‑scale video‑binaural audio dataset designed to support spatially aware video‑to‑audio generation; and we propose a end‑to‑end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual‑guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state‑of‑the‑art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. The demo page can be accessed at https://github.com/renlinjie868‑web/SpatialV2A.

Abstract:
Recent advances in generative models have enabled modern Text‑to‑Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time‑consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross‑Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S‑Orion/MOESCORE.

Abstract:
The development of audio foundation models has accelerated rapidly since the emergence of GPT‑4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross‑model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models' performance on Chinese. To address the first issue, we introduce UltraEval‑Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval‑Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one‑command evaluation feature, accompanied by real‑time public leaderboards. For the second challenge, UltraEval‑Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions: semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval‑Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval‑Audio.

Abstract:
Text‑to‑audio‑video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio‑video temporal synchronization, while largely overlooking explicit evaluation of audio‑physics grounding, thereby limiting the study of physically plausible audio‑visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio‑physics grounding capabilities of T2AV, image‑to‑audio‑video (I2AV), and video‑to‑audio (V2A) models. PhyAVBench offers PhyAV‑Sound‑11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired‑prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio‑physics dimensions and 41 fine‑grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio‑Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real‑world counterparts. We conduct a comprehensive evaluation of 17 state‑of‑the‑art models. Our results reveal that even leading commercial models struggle with fundamental audio‑physical phenomena, exposing a critical gap beyond audio‑visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio‑visual generation. Prompts, ground‑truth, and generated video samples are available at https://github.com/imxtx/PhyAVBench.

Abstract:
In this paper, we present JoVA, a unified framework for joint video‑audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video‑audio generation typically rely on explicit fusion or modality‑specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self‑attention across video and audio tokens within each transformer layer, enabling direct and efficient cross‑modal interaction without the need for additional alignment modules. Furthermore, to enable high‑quality lip‑speech synchronization, we introduce a simple yet effective mouth‑area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio‑driven state‑of‑the‑art methods in lip‑sync accuracy, speech quality, and overall video‑audio generation fidelity. Our results establish JoVA as an elegant framework for high‑quality multimodal generation.

Abstract:
We introduce a novel pipeline for joint audio‑visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state‑of‑the‑art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video‑to‑audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio‑visual alignment and content integrity.

Abstract:
Despite progress in video‑to‑audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two‑stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio‑temporal inconsistencies. To address this limitation, we introduce the task of end‑to‑end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video‑binaural audio pairs spanning diverse real‑world scenes and camera rotation trajectories, constructed through a semi‑automated pipeline. Furthermore, we propose ViSAudio, an end‑to‑end framework that employs conditional flow matching with a dual‑branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio‑temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state‑of‑the‑art methods across both objective metrics and subjective evaluations, generating high‑quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound‑source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio‑project.

Abstract:
Recent audio‑video generative systems suggest that coupling modalities benefits not only audio‑video synchrony but also the video modality itself. We pose a fundamental question: Does audio‑video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter‑efficient Audio‑Video Full DiT (AVFullDiT) architecture that leverages pre‑trained text‑to‑video (T2V) and text‑to‑audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V‑only counterpart under identical settings. Our results provide the first systematic evidence that audio‑video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision × impact sound), which in turn regularizes video dynamics. Our findings suggest that cross‑modal co‑training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

Abstract:
Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose UniTok‑Audio, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok‑Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual‑stream audio codec involving acoustic and semantic branch is developed for high‑fidelity waveform reconstruction. Experimental results demonstrate that UniTok‑Audio achieves competitive performance in comparation with state‑of‑the‑art task‑specific or multi‑task systems across five time‑aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language‑queried audio source separation. To foster future research, we will open‑source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified‑audio.

Abstract:
Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi‑modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long‑range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target‑Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi‑modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single‑condition and multi‑condition constrained generation tasks demonstrate that TGBFN achieves state‑of‑the‑art performance in generating high‑fidelity, condition‑aware CAD sequences. The code is available at https://github.com/scu‑zwh/TGBFN.

Abstract:
We present MGAudio, a novel flow‑based framework for open‑domain video‑to‑audio generation, which introduces model‑guided dual‑role alignment as a central design principle. Unlike prior approaches that rely on classifier‑based or classifier‑free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video‑conditioned audio generation. The framework integrates three main components: (1) a scalable flow‑based Transformer model, (2) a dual‑role alignment mechanism where the audio‑visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model‑guided objective that enhances cross‑modal coherence and audio realism. MGAudio achieves state‑of‑the‑art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier‑free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV‑100 benchmark. These results highlight model‑guided dual‑role alignment as a powerful and scalable paradigm for conditional video‑to‑audio generation. Code is available at: https://github.com/pantheon5100/mgaudio

Abstract:
Foley Control is a lightweight approach to video‑guided Foley that keeps pretrained single‑modality models frozen and learns only a small cross‑attention bridge between them. We connect V‑JEPA2 video embeddings to a frozen Stable Audio Open DiT text‑to‑audio (T2A) model by inserting compact video cross‑attention after the model's existing text cross‑attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio‑video dependency needed for synchronization ‑‑ without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video‑audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi‑modal systems, while preserving prompt‑driven controllability and production‑friendly modularity (swap/upgrade encoders or the T2A backbone without end‑to‑end retraining). Although we focus on Video‑to‑Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

Abstract:
Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE‑Audio, a unified speech and music generation model within a novel Dynamic‑Capacity Mixture‑of‑Experts (MoE) framework. Architecturally, UniMoE‑Audio introduces a Top‑P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain‑specific knowledge, shared experts for domain‑agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three‑stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain‑specific knowledge into each "proto‑expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE‑Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end‑to‑end on the fully balanced dataset, fostering enhanced cross‑domain synergy. Extensive experiments show that UniMoE‑Audio not only achieves state‑of‑the‑art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni‑MoE‑site/home.html

Abstract:
We introduce MMAudioSep, a generative model for video/text‑queried sound separation that is founded on a pretrained video‑to‑audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine‑tuning, the model retains the ability for original video‑to‑audio generation. This highlights the potential of foundational sound generation models to be adopted for sound‑related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.

Abstract:
Text‑to‑audio (TTA) generation with fine‑grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi‑task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine‑grained information, including text, timing, and phoneme features, through a step‑by‑step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large‑scale text‑audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine‑grained information, aligning inherently with the coarse‑to‑fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state‑of‑the‑art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control‑audio.github.io/Control‑Audio.

Abstract:
Sign language to spoken language audio translation is important to connect the hearing‑ and speech‑challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end‑to‑end framework that translates sign language videos with a sequence of possibly non‑grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi‑stage translation systems. Our approach combines an I3D‑based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non‑Maximal Suppression (NMS) algorithm for the temporal detection of signs in non‑grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL‑Citizen‑1500 and WLASL‑100 datasets with Top‑1 accuracies of 72.01% and 78.67%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems‑2025.

Abstract:
Prevailing Video‑to‑Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame‑level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end‑to‑end causality and targets low per‑frame latency with audio‑visual synchronization. Our model's backbone is a decoder‑only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end‑to‑end causality and efficiency. The model is trained through a diffusion pre‑training followed by consistency fine‑tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high‑quality full‑band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per‑frame waveform‑level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi‑saito‑sony.github.io/soundreactor/.

Abstract:
Recently, an increasing number of multimodal (text and audio) benchmarks have emerged, primarily focusing on evaluating models' understanding capability. However, exploration into assessing generative capabilities remains limited, especially for open‑ended long‑form content generation. Significant challenges lie in no reference standard answer, no unified evaluation metrics and uncontrollable human judgments. In this work, we take podcast‑like audio generation as a starting point and propose PodEval, a comprehensive and well‑designed open‑source evaluation framework. In this framework: 1) We construct a real‑world podcast dataset spanning diverse topics, serving as a reference for human‑level creative quality. 2) We introduce a multimodal evaluation strategy and decompose the complex task into three dimensions: text, speech and audio, with different evaluation emphasis on "Content" and "Format". 3) For each modality, we design corresponding evaluation methods, involving both objective metrics and subjective listening test. We leverage representative podcast generation systems (including open‑source, close‑source, and human‑made) in our experiments. The results offer in‑depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval in evaluating open‑ended long‑form audio. This project is open‑source to facilitate public use: https://github.com/yujxx/PodEval.

Abstract:
There is a high demand for audio‑visual editing in video post‑production and the film making field. While numerous models have explored audio and video editing, they struggle with object‑level audio‑visual operations. Specifically, object‑level audio‑visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present Object‑AVEdit, achieving the object‑level audio‑visual editing based on the inversion‑regeneration paradigm. To achieve the object‑level controllability during editing, we develop a word‑to‑sounding‑object well‑aligned audio generation model, bridging the gap in object‑controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object‑level editing effect, we propose an inversion‑regeneration holistically‑optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio‑video object‑level editing tasks with fine audio‑visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: https://gewu‑lab.github.io/Object_AVEdit‑website/.

Abstract:
Video‑conditioned audio generation, including Video‑to‑Sound (V2S) and Visual Text‑to‑Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow‑matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross‑attention for semantic conditions, and self‑attention for temporally‑intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end‑to‑end joint learning process. Furthermore, we use a straightforward feature‑level data synthesis method, demonstrating that our framework provides a robust foundation that easily adapts to joint sound and speech generation using synthetic data. Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state‑of‑the‑art domain‑specific baselines, underscoring the critical potential of unified generative models. Project page: https://vasflow1.github.io/vasflow/

Abstract:
Audio generation, including speech, music and sound effects, has advanced rapidly in recent years. These tasks can be divided into two categories: time‑aligned (TA) tasks, where each input unit corresponds to a specific segment of the output audio (e.g., phonemes aligned with frames in speech synthesis); and non‑time‑aligned (NTA) tasks, where such alignment is not available. Since modeling paradigms for the two types are typically different, research on different audio generation tasks has traditionally followed separate trajectories. However, audio is not inherently divided into such categories, making a unified model a natural and necessary goal for general audio generation. Previous unified audio generation works have adopted autoregressive architectures, while unified non‑autoregressive approaches remain largely unexplored. In this work, we propose UniFlow‑Audio, a universal audio generation framework based on flow matching. We propose a dual‑fusion mechanism that temporally aligns audio latents with TA features and integrates NTA features via cross‑attention in each model block. Task‑balanced data sampling is employed to maintain strong performance across both TA and NTA tasks. UniFlow‑Audio supports omni‑modalities, including text, audio, and video. By leveraging the advantage of multi‑task learning and the generative modeling capabilities of flow matching, UniFlow‑Audio achieves strong results across 7 tasks using fewer than 8K hours of public training data and under 1B trainable parameters. Even the small variant with only ~200M trainable parameters shows competitive performance, highlighting UniFlow‑Audio as a potential non‑auto‑regressive foundation model for audio generation. Code and models will be available at https://wsntxxn.github.io/uniflow_audio.

Abstract:
Current video‑to‑audio (V2A) methods struggle in complex multi‑event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic‑temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi‑event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio‑visual pretraining (AVP) to enhance performance in complex multi‑event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF‑CAVP), a pioneering AVP model with a unified dual‑stream architecture. SF‑CAVP explicitly aligns core semantic representations and rapid dynamic features of audio‑visual data to handle multi‑event complexity; second, we integrate the DPO method into V2A task and propose AVP‑Ranked Preference Optimization (AVP‑RPO). It uses SF‑CAVP as a reward model to quantify and prioritize critical semantic‑temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state‑of‑the‑art (SOTA) performance in multi‑event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. Demos are available at https://v2aresearch.github.io/MultiSoundGen/.

Abstract:
This work presents STAR, the first end‑to‑end speech‑to‑audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. Unlike prior approaches relying on text or vision, STAR leverages speech as it constitutes a natural modality for interaction. As an initial step to validate the feasibility of the system, we demonstrate through representation learning experiments that spoken sound event semantics can be effectively extracted from raw speech, capturing both auditory events and scene cues. Leveraging the semantic representations, STAR incorporates a bridge network for representation mapping and a two‑stage training strategy to achieve end‑to‑end synthesis. With a 76.9% reduction in speech processing latency, STAR demonstrates superior generation performance over the cascaded systems. Overall, STAR establishes speech as a direct interaction signal for audio generation, thereby bridging representation learning and multimodal synthesis. Generated samples are available at https://zeyuxie29.github.io/STAR.

Abstract:
With the development of large‑scale diffusion‑based and language‑modeling‑based generative models, impressive progress has been achieved in text‑to‑audio generation. Despite producing high‑quality outputs, existing text‑to‑audio models mainly aim to generate semantically aligned sound and fall short of controlling fine‑grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text‑to‑audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user‑provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text‑to‑audio tasks. We also provide a human‑involved dataset containing audio events from real‑world CTTA cases as the benchmark for customized generation tasks.

Abstract:
Text‑to‑Audio (TTA) generation has made rapid progress, but current evaluation methods remain narrow, focusing mainly on perceptual quality while overlooking robustness, generalization, and ethical concerns. We present TTA‑Bench, a comprehensive benchmark for evaluating TTA models across functional performance, reliability, and social responsibility. It covers seven dimensions including accuracy, robustness, fairness, and toxicity, and includes 2,999 diverse prompts generated through automated and manual methods. We introduce a unified evaluation protocol that combines objective metrics with over 118,000 human annotations from both experts and general users. Ten state‑of‑the‑art models are benchmarked under this framework, offering detailed insights into their strengths and limitations. TTA‑Bench establishes a new standard for holistic and responsible evaluation of TTA systems. The dataset and evaluation tools are open‑sourced at https://nku‑hlt.github.io/tta‑bench/.

Abstract:
While recent work in controllable text‑to‑audio (TTA) generation has achieved fine‑grained control through timestamp conditioning, its scope remains limited by audio quality and input format. These models often suffer from poor audio quality in real datasets due to sole reliance on synthetic data. Moreover, some models are constrained to a closed vocabulary of sound events, preventing them from controlling audio generation for open‑ended, free‑text queries. This paper introduces PicoAudio2, a framework that advances temporal‑controllable TTA by mitigating these data and architectural limitations. Specifically, we use a grounding model to annotate event timestamps of real audio‑text datasets to curate temporally‑strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, we propose an enhanced architecture that integrates the fine‑grained information from a timestamp matrix with coarse‑grained free‑text input. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

Abstract:
Recent advances in text‑to‑audio (TTA) generation excel at synthesizing short audio clips but struggle with long‑form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long‑form audio narratives. AudioStory possesses strong instruction‑following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub‑tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM‑diffuser collaboration into two specialized components, i.e., a bridging query for intra‑event semantic alignment and a residual query for cross‑event coherence preservation. (2) End‑to‑end training: By unifying instruction comprehension and audio generation within a single end‑to‑end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory‑10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single‑audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction‑following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory

Abstract:
Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video‑to‑audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo‑Foley, an end‑to‑end text‑video‑to‑audio framework that synthesizes high‑fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k‑hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self‑supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual‑stream audio‑video fusion through joint attention, and textual semantic injection via cross‑attention. Comprehensive evaluations demonstrate that HunyuanVideo‑Foley achieves new state‑of‑the‑art performance across audio fidelity, visual‑semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo‑foley/.

Abstract:
Generating high‑quality and temporally synchronized audio from video content is essential for video editing and post‑production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short‑form audio generation for video segments under 10 seconds or rely on noisy datasets for long‑form video‑to‑audio zsynthesis. To address these limitations, we introduce LD‑LAudio‑V1, an extension of state‑of‑the‑art video‑to‑audio models and it incorporates dual lightweight adapters to enable long‑form audio generation. In addition, we release a clean and human‑annotated video‑to‑audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine‑tuning with short training videos, LD‑LAudio‑V1 achieves significant improvements across multiple metrics: FD_\textpasst 450.00 \rightarrow 327.29 (+27.27%), FD_\textpanns 34.88 \rightarrow 22.68 (+34.98%), FD_\textvgg 3.75 \rightarrow 1.28 (+65.87%), KL_\textpanns 2.49 \rightarrow 2.07 (+16.87%), KL_\textpasst 1.78 \rightarrow 1.53 (+14.04%), IS_\textpanns 4.17 \rightarrow 4.30 (+3.12%), IB_\textscore 0.25 \rightarrow 0.28 (+12.00%), Energy\Delta10\textms 0.3013 \rightarrow 0.1349 (+55.23%), Energy\Delta10\textms(vs.GT) 0.0531 \rightarrow 0.0288 (+45.76%), and Sem.\,Rel. 2.73 \rightarrow 3.28 (+20.15%). Our dataset aims to facilitate further research in long‑form video‑to‑audio generation and is available at https://github.com/deepreasonings/long‑form‑video2audio.

Abstract:
Diffusion and flow‑matching models have revolutionized automatic text‑to‑audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics‑to‑song models, such as, DiffRhythm, ACE‑Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine‑grained word‑level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow‑matching‑based JAM is the first effort toward endowing word‑level timing and duration control in song generation, allowing fine‑grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics‑to‑song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music‑specific attributes.

Abstract:
Text‑to‑audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally‑aligned audio‑text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s‑5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing‑conditioned 10‑second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training‑free timing‑controlled T2A framework, FreeAudio, making the first attempt to enable timing‑controlled long‑form T2A generation, e.g., "owl hooted at 2.4s‑5.2s and crickets chirping at 0s‑24s". Specifically, we first employ an LLM to plan non‑overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state‑of‑the‑art timing‑conditioned T2A synthesis quality among training‑free methods and is comparable to leading training‑based methods; 2) FreeAudio demonstrates comparable long‑form generation quality with training‑based Stable Audio and paves the way for timing‑controlled long‑form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/

Abstract:
Video‑to‑audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods relying on global video information struggle with complex scenes and generating audio tailored to specific objects. To address these limitations, we introduce Hear‑Your‑Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame. To achieve this, we propose Object‑aware Contrastive Audio‑Visual Fine‑tuning (OCAV) with a Mask‑guided Visual Encoder (MVE) to obtain object‑level visual features aligned with audio. Furthermore, we tailor two data augmentation strategies, Random Video Stitching (RVS) and Mask‑guided Loudness Modulation (MLM), to enhance the model's sensitivity to segmented objects. To measure audio‑visual correspondence, we designed a new evaluation metric, the CAV score. Extensive experiments demonstrate that our framework offers more precise control and improves generation performance across various metrics. Project Page: https://github.com/SynapGrid/Hear‑Your‑Click

Abstract:
Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre‑trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel‑level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics‑based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training‑free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at https://x‑drunker.github.io/Sonic4D‑project‑page.

Abstract:
Spatial audio is essential for enhancing the immersiveness of audio‑visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first‑order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT‑Ambigen, a dataset comprising 102K 5‑second YouTube video clips paired with corresponding first‑order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video‑to‑Spatial Audio Generation (ViSAGe), an end‑to‑end framework that generates first‑order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental results demonstrate that ViSAGe produces plausible and coherent first‑order ambisonics, outperforming two‑stage approaches consisting of video‑to‑audio generation and audio spatialization. Qualitative examples further illustrate that ViSAGe generates temporally aligned high‑quality spatial audio that adapts to viewpoint changes.

Abstract:
This paper presents BemaGANv2, an advanced GAN‑based vocoder designed for high‑fidelity and long‑term audio generation, with a focus on systematic evaluation of discriminator combination strategies. Long‑term audio generation is critical for applications in Text‑to‑Music (TTM) and Text‑to‑Audio (TTA) systems, where maintaining temporal co‑ herence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti‑aliased Multi‑Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi‑Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal en‑ velope features crucial for periodicity detection. Coupled with the Multi‑Resolution Discriminator (MRD), this com‑ bination enables more accurate modeling of long‑range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi‑Scale Discriminator (MSD) + MED, MSD + MRD, and Multi‑Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similar‑ ity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel‑Cepstral Distortion (MCD), Multi‑Resolution STFT (M‑STFT), Periodicity error (Periodicity)) and subjective evaluations (MOS, SMOS). To support reproducibility, we provide detailed architectural descriptions, training configurations, and complete implementation details. The code, pre‑trained models, and audio demo samples are available at: https://github.com/dinhoitt/BemaGANv2.

Abstract:
Diffusion models have emerged as powerful deep generative techniques, producing high‑quality and diverse samples in applications in various domains including audio. While existing reviews provide overviews, there remains limited in‑depth discussion of these specific design choices. The audio diffusion model literature also lacks principled guidance for the implementation of these design choices and their comparisons for different applications. This survey provides a comprehensive review of diffusion model design with an emphasis on design principles for quality improvement and conditioning for audio applications. We adopt the score modeling perspective as a unifying framework that accommodates various interpretations, including recent approaches like flow matching. We systematically examine the training and sampling procedures of diffusion models, and audio applications through different conditioning mechanisms. To provide an integrated, unified codebase and to promote reproducible research and rapid prototyping, we introduce an open‑source codebase (https://github.com/gzhu06/AudioDiffuser) that implements our reviewed framework for various audio applications. We demonstrate its capabilities through three case studies: audio generation, speech enhancement, and text‑to‑speech synthesis, with benchmark evaluations on standard datasets.

Abstract:
As deep learning advances in audio generation, challenges in audio security and copyright protection highlight the need for robust audio watermarking. Recent neural network‑based methods have made progress but still face three main issues: preventing unauthorized access, decoding initial watermarks after multiple embeddings, and embedding varying lengths of watermarks. To address these issues, we propose WAKE, the first key‑controllable audio watermark framework. WAKE embeds watermarks using specific keys and recovers them with corresponding keys, enhancing security by making incorrect key decoding impossible. It also resolves the overwriting issue by allowing watermark decoding after multiple embeddings and supports variable‑length watermark insertion. WAKE outperforms existing models in both watermarked audio quality and watermark detection accuracy. Code, more results, and demo page: https://thuhcsi.github.io/WAKE.

Abstract:
Language‑queried Audio Source Separation (LASS) enables open‑vocabulary sound separation via natural language queries. While existing methods rely on task‑specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training‑free framework leveraging generative priors for zero‑shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality‑specific challenges. To address these issues, we propose Diffusion‑Guided Mask Optimization (DGMO), a test‑time optimization framework that refines spectrogram masks for precise, input‑aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task‑specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero‑shot audio separation. The code is available at: https://wltschmrz.github.io/DGMO/

Abstract:
Text‑to‑audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion‑based frameworks, including the Tango and the AudioLDM series, represent the state‑of‑the‑art in text‑to‑audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask‑based model operating on discrete tokens, addresses slow inference through iterative mask‑based parallel decoding. However, its audio quality still lags behind that of diffusion‑based models. In this work, we introduce IMPACT, a text‑to‑audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask‑based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state‑of‑the‑art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio‑impact.github.io/.

Abstract:
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in‑domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD‑Bench, a large‑scale cross‑domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross‑domain evaluation setup, where audio deepfake detectors can be tested "in the wild". Our in‑domain and cross‑domain experiments indicate a clear disparity between the in‑domain performance of deepfake detectors, which is usually as high as 100%, and the cross‑domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad‑bench/.

Abstract:
Classifier‑Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade‑off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well‑defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero ‑‑ where the data distribution is tilted by a power w \gt 1 of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low‑noise limit. Second, motivated by this insight, we propose a Gibbs‑like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text‑to‑audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig

Abstract:
We present Sat2Sound, a unified multimodal framework for geospatial soundscape understanding, designed to predict and map the distribution of sounds across the Earth's surface. Existing methods for this task rely on paired satellite images and geotagged audio samples, which often fail to capture the full diversity of sound at a location. Sat2Sound overcomes this limitation by augmenting datasets with semantically rich, vision‑language model‑generated soundscape descriptions, which broaden the range of possible ambient sounds represented at each location. Our framework jointly learns from audio, text descriptions of audio, satellite images, and synthetic image captions through contrastive and codebook‑aligned learning, discovering a set of "soundscape concepts" shared across modalities, enabling hyper‑localized, explainable soundscape mapping. Sat2Sound achieves state‑of‑the‑art performance in cross‑modal retrieval between satellite image and audio on the GeoSound and SoundingEarth benchmarks. Finally, by retrieving detailed soundscape captions that can be rendered through text‑to‑audio models, Sat2Sound enables location‑conditioned soundscape synthesis for immersive and educational applications, even with limited computational resources. Our code and models are available at https://github.com/mvrl/sat2sound.

Abstract:
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero‑shot classification, audio retrieval, audio captioning, and text‑conditioned audio generation. Existing contrastive language‑audio pretrained models are typically trained using global, clip‑level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP‑like language‑audio models ‑ particularly, if they are expected to produce frame‑level embeddings ‑ can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single‑sentence free‑text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non‑audible events, transcribed speech, typos, and annotator language bias. We further propose a frame‑wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text‑audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.

Abstract:
Traditional video‑to‑audio generation techniques primarily focus on perspective video and non‑spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360‑degree videos, specifically producing First‑order Ambisonics (FOA) audio ‑ a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real‑world data. We also design an efficient semi‑automated pipeline for collecting and cleaning paired video‑audio data. To generate spatial audio from 360‑degree video, we propose a novel framework OmniAudio, which leverages self‑supervised pre‑training using both spatial audio data (in FOA format) and large‑scale non‑spatial data. Furthermore, OmniAudio features a dual‑branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360‑degree videos. Experimental results demonstrate that OmniAudio achieves state‑of‑the‑art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at https://github.com/liuhuadai/OmniAudio. The project website is available at https://OmniAudio‑360V2SA.github.io.

Abstract:
In recent years, text‑to‑audio systems have achieved remarkable success, enabling the generation of complete audio segments directly from text descriptions. While these systems also facilitate music creation, the element of human creativity and deliberate expression is often limited. In contrast, the present work allows composers, arrangers, and performers to create the basic building blocks for music creation: audio of individual musical notes for use in electronic instruments and DAWs. Through text prompts, the user can specify the timbre characteristics of the audio. We introduce a system that combines a latent diffusion model and multi‑modal contrastive learning to generate musical timbres conditioned on text descriptions. By jointly generating the magnitude and phase of the spectrogram, our method eliminates the need for subsequently running a phase retrieval algorithm, as related methods do. Audio examples, source code, and a web app are available at https://wxuanyuan.github.io/Musical‑Note‑Generation/

Abstract:
Although significant progress has been made in audio‑driven talking head generation, text‑driven methods remain underexplored. In this work, we present OmniTalker, a unified framework that jointly generates synchronized talking audio‑video content from input text while emulating the speaking and facial movement styles of the target identity, including speech characteristics, head motion, and facial dynamics. Our framework adopts a dual‑branch diffusion transformer (DiT) architecture, with one branch dedicated to audio generation and the other to video synthesis. At the shallow layers, cross‑modal fusion modules are introduced to integrate information between the two modalities. In deeper layers, each modality is processed independently, with the generated audio decoded by a vocoder and the video rendered using a GAN‑based high‑quality visual renderer. Leveraging the in‑context learning capability of DiT through a masked‑infilling strategy, our model can simultaneously capture both audio and visual styles without requiring explicit style extraction modules. Thanks to the efficiency of the DiT backbone and the optimized visual renderer, OmniTalker achieves real‑time inference at 25 FPS. To the best of our knowledge, OmniTalker is the first one‑shot framework capable of jointly modeling speech and facial styles in real time. Extensive experiments demonstrate its superiority over existing methods in terms of generation quality, particularly in preserving style consistency and ensuring precise audio‑video synchronization, all while maintaining efficient inference.

Abstract:
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large‑scale, high‑quality training data. As such, we propose AudioX, a unified framework for anything‑to‑audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross‑modal alignment and improving overall generation quality. To train this unified model, we construct a large‑scale, high‑quality dataset, IF‑caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal‑conditioned audio generation. We benchmark AudioX against state‑of‑the‑art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text‑to‑audio and text‑to‑music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction‑following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.

Abstract:
Current audio generation conditioned by text or video focuses on aligning audio with text/video modalities. Despite excellent alignment results, these multimodal frameworks still cannot be directly applied to compelling movie storytelling involving multiple scenes, where "on‑screen" sounds require temporally‑aligned audio generation, while "off‑screen" sounds contribute to appropriate environment sounds accompanied by background music when applicable. Inspired by professional movie production, this paper proposes a multi‑agentic framework for audio generation supervised by an autonomous Sound Director agent, engaging multi‑turn conversations with other agents for on‑screen and off‑screen sound generation through multimodal LLM. To address on‑screen sound generation, after detecting any talking humans in videos, we capture semantically and temporally synchronized sound by training a prediction model that forecasts interpretable, time‑varying audio control signals: loudness, pitch, and timbre, which are used by a Foley Artist agent to condition a cross‑attention module in the sound generation. The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off‑screen sound to complement the overall production. Each agent takes on specific roles similar to those of a movie production team. To temporally ground audio language models, in ReelWave, text/video conditions are decomposed into atomic, specific sound generation instructions synchronized with visuals when applicable. Consequently, our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.

Abstract:
Video‑to‑audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text‑to‑audio generative diffusion models, this paper presents how to balance the representation of mel‑spectrograms in terms of completeness and complexity through a new approach called Mel Quantization‑Continuum Decomposition (Mel‑QCD). We decompose the mel‑spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video‑to‑all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel‑QCD method demonstrates state‑of‑the‑art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \hrefWebsitehttps://wjc2830.github.io/MelQCD/.

Abstract:
Existing Existing automatic audio generation methods struggle to generate podcast‑like audio programs effectively. The key challenges lie in in‑depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic‑discussion content by designing a Host‑Guest‑Writer multi‑agent collaboration system, 2) builds a voice pool for suitable voice‑role matching and 3) utilizes LLM‑enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast‑like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT‑4 generation in topic‑discussion dialogue content, achieving an 87.4% voice‑matching accuracy, and producing more expressive speech through LLM‑guided synthesis. Demo page: https://podcast‑agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.

Abstract:
We introduce InspireMusic, a framework integrated super resolution and large language model for high‑fidelity long‑form music generation. A unified framework generates high‑fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super‑resolution flow‑matching model. This framework enables the controllable generation of high‑fidelity long‑form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high‑quality audio generation with long‑form coherence of up to 8 minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super‑resolution flow‑matching model to generate high‑sampling rate audio with fine‑grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic‑1.5B‑Long model has a comparable performance to recent top‑tier open‑source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre‑trained models are released at https://github.com/FunAudioLLM/InspireMusic.

Abstract:
We introduce AV‑Flow, an audio‑visual generative model that animates photo‑realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human‑like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV‑Flow produces an always‑on avatar, that actively listens and reacts to the audio‑visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural‑looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV‑Flow/

Abstract:
Text‑to‑audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text‑guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single‑source sound event datasets. To address this, we propose AudioSpa, an end‑to‑end model that applies large language models to process both acoustic and textual information. We employ fusion multi‑head attention (FMHA) to integrate text tokens, which enhances the generation capability of the multimodal learning. Additionally, we propose a binaural source localization model to assess the quality of the generated audio. Finally, we design a data augmentation strategy to generate diverse datasets, which enables the model to spatialize sound events across various spatial positions. Experimental results demonstrate that our model is able to put sounds at the specified locations accurately. It achieves competitive performance in both localization accuracy and signal distortion. Our demonstrations are available at https://linfeng‑feng.github.io/AudioSpa‑demo.

Abstract:
Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text‑to‑speech, text‑to‑audio, and text‑to‑music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder‑only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language‑Audio Pretraining) embedding, which has timbre‑related information. Our model is capable of performing instrument cloning, text‑to‑instrument synthesis, and text‑guided timbre manipulation without any fine‑tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth

Abstract:
Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross‑projection module, position captioning, and a three‑step training process to enhance the model's ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and SoTA Audio‑Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross‑projection, language model parameters, position captioning, third stage fine‑tuning, and present our findings. Our benchmarks, findings, and strong baseline pave the way for nuanced and human‑like explanations of audio differences.

Abstract:
Many video‑to‑audio (VTA) methods have been proposed for dubbing silent AI‑generated videos. An efficient quality assessment method for AI‑generated audio‑visual content (AGAV) is crucial for ensuring audio‑visual quality. Existing audio‑visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduce AGAVQA‑3k, the first large‑scale AGAV quality assessment dataset, comprising 3,382 AGAVs from 16 VTA methods. AGAVQA‑3k includes two subsets: AGAVQA‑MOS, which provides multi‑dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA‑Pair, designed for optimal AGAV pair selection. We further propose AGAV‑Rater, a LMM‑based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV‑Rater achieves state‑of‑the‑art performance on AGAVQA‑3k, Text‑to‑Audio, and Text‑to‑Music datasets. Subjective tests also confirm that AGAV‑Rater enhances VTA performance and user experience. The dataset and code is available at https://github.com/charlotte9524/AGAV‑Rater.

Abstract:
Recent years have seen significant progress in Text‑To‑Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large‑scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF‑Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text‑To‑Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions ‑‑ a task that is more challenging than current benchmarks.

Abstract:
Despite significant advancements in Text‑to‑Audio (TTA) generation models achieving high‑fidelity audio with fine‑grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real‑world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA

Abstract:
We propose to synthesize high‑quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single‑modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger‑scale, readily available text‑audio data to learn to generate semantically aligned high‑quality audio samples. Additionally, we improve audio‑visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video‑to‑audio state‑of‑the‑art among public models in terms of audio quality, semantic alignment, and audio‑visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text‑to‑audio generation, showing that joint training does not hinder single‑modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

Abstract:
Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on‑screen motion. This process is time‑consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision‑to‑audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two‑stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal scaffold for the audio. In the second stage, a diffusion‑based generative model produces sound effects conditioned both on this temporal envelope and on high‑level semantic embeddings, provided by the user, that define the desired auditory content (e.g., material or action type). This modular design enables precise control over both timing and timbre, streamlining repetitive tasks while preserving creative flexibility in professional Foley workflows. Results on diverse visual contexts, such as footstep generation and action‑specific sonorization, demonstrate that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic. These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high‑quality Foley sound synthesis in professional and interactive settings. Supplementary materials are accessible on our dedicated demo page at https://ispamm.github.io/FolAI.

Abstract:
The field of text‑to‑audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under‑explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative applications in sound design and content creation. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model‑agnostic and is based on learning the disentanglement between audio semantics and its acoustic features. Our approach not only enhances the versatility and expressiveness of text‑to‑audio generation but also opens new avenues for creative audio production and sound design. Our objective and subjective evaluation results demonstrate the effectiveness of our approach in producing high‑quality, customizable audio outputs that align closely with user specifications.

Abstract:
Generating sound effects for product‑level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high‑quality sounds in few‑shot settings. To tackle the challenge of limited labeled data in real‑world scenes, we introduce YingSound, a foundation model designed for video‑guided sound generation that supports high‑quality audio generation in few‑shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio‑visual aggregator (AVA) that integrates high‑resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi‑modal visual‑audio chain‑of‑thought (CoT) approach to generate finer sound effects in few‑shot settings. Finally, an industry‑standard video‑to‑audio (V2A) dataset that encompasses various real‑world scenarios is presented. We show that YingSound effectively generates high‑quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \urlhttps://giantailab.github.io/yingsound/

Abstract:
We introduce OmniFlow, a novel generative model designed for any‑to‑any generation tasks such as text‑to‑image, text‑to‑audio, and audio‑to‑image synthesis. OmniFlow advances the rectified flow (RF) framework used in text‑to‑image models to handle the joint distribution of multiple modalities. It outperforms previous any‑to‑any models on a wide range of tasks, such as text‑to‑image and text‑to‑audio synthesis. Our work offers three key contributions: First, we extend RF to a multi‑modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text‑to‑image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text‑to‑image MMDiT for fine‑tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large‑scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

Abstract:
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real‑life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video‑guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low‑quality audio and professional SFX recordings, enabling high‑quality, full‑bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high‑quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

Abstract:
Audio synthesis has broad applications in multimedia. Recent advancements have made it possible to generate relevant audios from inputs describing an audio scene, such as images or texts. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source‑Aware Audio (SS2A) generator. SS2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross‑modality translation. It then contrastively learns a Cross‑Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single‑sound‑source visual‑audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to clearly measure localized audio relevance. With the effectiveness of explicit sound source modeling, SS2A achieves state‑of‑the‑art performance in extensive image‑to‑audio tasks. We also qualitatively demonstrate SS2A's ability to achieve intuitive synthesis control by compositing vision, text, and audio conditions. Furthermore, we show that our sound source modeling can achieve competitive video‑to‑audio performance with a straightforward temporal aggregation mechanism.

Abstract:
Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource‑intensive attention and feed‑forward modules. To address this, we introduce SmoothCache, a model‑agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer‑wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT‑XL for image generation, Open‑Sora for text‑to‑video, and Stable Audio Open for text‑to‑audio, highlighting its potential to enable real‑time applications and broaden the accessibility of powerful DiT models.

Abstract:
The content of visual and audio scenes is multi‑faceted such that a video can be paired with various audio and vice‑versa. Thereby, in video‑to‑audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video‑to‑Audio generation is a well‑established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi‑modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video‑to‑Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine‑tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video‑to‑audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video‑to‑audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text‑guided video‑to‑audio generation and video‑to‑audio captioning.

Abstract:
Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi‑Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations. The code and model weights are open‑sourced at https://github.com/hubertsiuzdak/snac.

Abstract:
Recent advancements in latent diffusion models (LDMs) have markedly enhanced text‑to‑audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods utilizing consistency‑based distillation aim to achieve few‑step or single‑step inference, their one‑step performance is constrained by curved trajectories, preventing them from surpassing traditional diffusion models. In this work, we introduce FlashAudio with rectified flows to learn straight flow for fast simulation. To alleviate the inefficient timesteps allocation and suboptimal distribution of noise, FlashAudio optimizes the time distribution of rectified flow with Bifocal Samplers and proposes immiscible flow to minimize the total distance of data‑noise pairs in a batch vias assignment. Furthermore, to address the amplified accumulation error caused by the classifier‑free guidance (CFG), we propose Anchored Optimization, which refines the guidance scale by anchoring it to a reference trajectory. Experimental results on text‑to‑audio generation demonstrate that FlashAudio's one‑step generation performance surpasses the diffusion‑based models with hundreds of sampling steps on audio quality and enables a sampling speed of 400x faster than real‑time on a single NVIDIA 4090Ti GPU. Code will be available at https://github.com/liuhuadai/FlashAudio.

Abstract:
We present Synthio, a novel approach for augmenting small‑scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real‑world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text‑to‑audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small‑scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small‑scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small‑scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited‑data settings. Results indicate our method consistently outperforms all baselines by 0.1%‑39% using a T2A model trained only on weakly‑captioned AudioSet.

Abstract:
This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open‑vocabulary natural language prompts (e.g., "make this sound in‑your‑face and bold"). Text2FX operates without retraining any models, relying instead on single‑instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open‑vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text‑audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at anniejchu.github.io/text2fx.

Abstract:
Diffusion‑based text‑to‑audio (TTA) generation has made substantial progress, leveraging latent diffusion model (LDM) to produce high‑quality, diverse and instruction‑relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training‑free audio editing framework built on the pretrained diffusion‑based TTA model. AudioEditor incorporates Null‑text Inversion and EOT‑suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high‑quality audio edits. Code and demo can be found at https://github.com/NKU‑HLT/AudioEditor.

Abstract:
Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high‑quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source‑Disentangled Neural Audio Codec (SD‑Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD‑Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD‑Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.

Abstract:
We introduce EzAudio, a text‑to‑audio (T2A) generation framework designed to produce high‑quality, natural‑sounding sound effects. Core designs include: (1) We propose EzAudio‑DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier‑free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open‑source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre‑trained models are released at: https://haidog‑yaqub.github.io/EzAudio‑Page/.

Abstract:
Visual and auditory perception are two crucial ways humans experience the world. Text‑to‑video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video‑to‑Audio (STA‑V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross‑modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text‑to‑Audio priors initialization and cross‑modal guidance. We also introduce Audio‑Audio Align, a new metric to assess audio‑temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video‑to‑Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y‑ren16.github.io/STAV2A.

Abstract:
Discrete tokens obtained from neural audio codecs (NACs) have been used as compact representations in audio generation and understanding models. In such token‑based systems, token temporal resolution (TTR), defined as the time interval between adjacent token frames, is important because it controls the trade‑off between representing rapid acoustic events and reducing token‑sequence length. However, most NACs are trained at a single TTR and require separate training for each TTR. This paper proposes a mechanism that enables a single NAC to operate at multiple TTRs using sampling‑frequency‑independent convolutional layers. The mechanism regards TTR as the sampling period of the token sequence and generates TTR‑dependent convolutional kernels from a shared parameter set, while adjusting the kernel size and stride for each TTR. We incorporate the mechanism into Descript Audio Codec, leaving the quantizer unchanged. Experiments on environmental sound reconstruction show that the proposed model outperforms a single‑model baseline that switches TTR‑specific layers for each TTR.

Abstract:
This work investigates the effect of batch sampling strategies during training for text‑to‑audio music generation under low‑data and small‑scale model settings. This paper describes our approach and findings for the ICME 2026 Grand Challenge on Academic Text‑to‑Music Generation. Training data are clustered using either text embeddings or audio embeddings, and samples with similar characteristics are grouped within the same mini‑batch to mitigate gradient interference. The effects of modality and cluster granularity on clustering are analyzed. Results show that clustering based on text embeddings achieves better performance on objective evaluation metrics than clustering based on audio embeddings. In addition, different cluster granularity leads to different behaviors across evaluation criteria: a moderate number of clusters performs best on objective metrics, while a larger number of clusters tends to exhibit music with more coherent structure in listening tests.

Abstract:
Diffusion‑based text‑to‑audio (TTA) models achieve impressive synthesis quality but suffer from high inference latency due to iterative multi‑step denoising. Existing one‑step approaches alleviate this issue but still rely on paired text‑‑audio data during distillation. To address these limitations, we propose SwiftAudio, a one‑step TTA framework that performs audio‑free distillation from a pretrained diffusion teacher using only text captions. Specifically, we adapt Variational Score Distillation (VSD) to the audio domain and introduce a temporal smoothness regularization objective to encourage coherent latent audio representations. This design enables the student model to inherit the teacher's generative prior without requiring paired audio supervision and allows effective training with only approximately 45K captions. Experiments on AudioCaps and Clotho demonstrate that SwiftAudio achieves state‑of‑the‑art performance among strict one‑step methods and substantially narrows the gap to multi‑step diffusion systems. Project page: https://swiftaudio.org/

Abstract:
Automatic Speech Recognition (ASR) systems, despite achieving remarkable accuracy in general‑purpose domains with native speech (L1), struggle in domains like Air Traffic Control (ATC) due to strong channel noise, a presence of non‑native (L2) English accents, and data scarcity. We propose a synthetic data generation pipeline with acoustical properties simulations specifically designed to address this lack of real data to improve recognition accuracy in the ATC domain. Our approach leverages a combination of neural generation techniques, including Text‑to‑Speech, Voice Conversion, L2‑to‑L1 accent conversion, and a novel controllable L1‑to‑L2 accent conversion framework built to simulate accented speech. Our experiments with the Whisper model on the ATCO2 corpus demonstrate that fine‑tuning with either synthetic data alone, or a mix of real and synthetic data, significantly improves the word error rate over out‑of‑the‑box and real data only baselines respectively.

Abstract:
Text‑to‑audio (TTA) generation, synthesizing audio from natural language, has been widely studied for its ability to capture precise user intent. To effectively advance TTA models, it is essential to reliably evaluate generated audio without relying on costly human subjective ratings, motivating the development of automatic evaluation metrics that correlate well with human judgments. While recent CLAP‑based metrics provide practical reference‑free solutions, their coarse‑grained text‑audio similarity matching often correlates poorly with human ratings. To address this, we propose ELSA, a reference‑free evaluation metric for fine‑grained text‑audio alignment. ELSA decomposes generated audio guided by distinct acoustic events derived from the text query and assesses event‑level alignment. Experiments across four TTA benchmarks show that ELSA reveals a higher correlation with human subjective ratings than prior metrics, highlighting its effectiveness for reliable TTA evaluation.

Abstract:
Text‑to‑audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training‑free framework leveraging the state‑of‑the‑art Rectified Flow‑based TangoFlux model. FreeSonic utilizes an optimized inversion‑reverse process and joint text‑audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task‑oriented noise injection enhances versatility for tasks such as audio removal and non‑rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high‑fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free‑sonic.github.io/

Abstract:
We introduce AudEdit, an inversion‑free method for text‑guided editing of real audio with a pretrained rectified‑flow audio generator. Text‑to‑audio systems such as Stable Audio 3 already expose audio‑to‑audio editing by noising an input recording and denoising it under a new prompt, but this inversion‑style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long‑range musical structure. Motivated by recent inversion‑free flow editing in computer vision, we develop an audio‑specific direct source‑to‑target ordinary differential equation for one‑dimensional Stable Audio 3 latents: at each flow step, we compare the target‑ and source‑conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound‑effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target‑text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.

Abstract:
Audio language models (ALMs) are increasingly used for speech‑based understanding, yet their ability to perform semantic reasoning beyond transcription, Text‑to‑Audio Retrieval, Captioning, and Question‑Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over‑inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model's ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment

Abstract:
Handling toxic retrieval in text‑to‑audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing framework with a sentiment‑controlled mediator to preserve semantic relevance while suppressing harmful speech. Our approach is model‑agnostic and integrates seamlessly with existing retrieval pipelines. We introduce two variants: Forgive, which re‑ranks and filters toxic audio via logit adjustment, and Forget, which generates counterfactual toxic prompts to mitigate harmful retrievals. Experiments show consistent toxicity reduction with minimal loss in retrieval accuracy, improving both safety and reliability.

Abstract:
Text‑to‑audio retrieval has made significant progress with shared embedding models such as CLAP and Pengi, yet they often struggle with fine‑grained semantic alignment due to the inherent modality gap between text and audio. In this work, we propose FORTE, a unified framework that integrates structured logical reasoning with parameter‑efficient cross‑modal alignment to improve retrieval precision. Our approach first transforms queries into first‑order logic and refines them via a constrained search that preserves semantic invariance while introducing discriminative attributes. The refined representation is then aligned with audio embeddings using a lightweight projection module, followed by a predicate‑aware re‑ranking step that enforces logical consistency at inference. Extensive experiments on AudioCaps and Clotho demonstrate consistent improvements over strong baselines, particularly in challenging fine‑grained scenarios. Our results highlight the effectiveness of combining symbolic reasoning with representation learning for cross‑modal retrieval.

Abstract:
Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task‑level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley‑Omni, a unified multimodal audio generation model that extends isolated task‑level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST‑Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley‑Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.

Abstract:
Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song‑level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low‑level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures their distinct roles and interactions, limiting arrangement richness of generated songs. In this paper, we present SketchSong, a hierarchical song generation framework that addresses these issues through song‑level sketch planning and fine‑grained multi‑track modeling. Along the temporal dimension, SketchSong first predicts a compact sequence of high‑level sketch tokens derived from compressed audio representations, and then generates audio tokens conditioned on these sketches. This coarse‑to‑fine process gives the model an explicit arrangement plan before detailed audio generation. Along the track dimension, SketchSong explicitly models four tracks, i.e., vocals, bass, drums and other instruments. This enables the model to capture the roles and interactions of different musical parts more precisely. Experiments on song generation benchmarks show that SketchSong consistently outperforms our baseline on both objective metrics and human listening tests. Despite not employing additional post‑training for preference optimization such as lyrics and text‑prompt alignments, SketchSong achieves competitive results against strong, post‑trained open‑source systems, demonstrating the effectiveness of our overall design.

Abstract:
The rapid advancement of instruction‑guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general‑purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine‑grained attribute mismatches. To address this, we introduce a novel dynamic rubric‑based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio‑Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large‑scale corpus of 105K samples with explicit Chain‑of‑Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio‑Judge model. By employing a training pipeline that combines Supervised Fine‑Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric‑based scoring mechanism. Extensive experiments demonstrate that AnyAudio‑Judge not only significantly enhances zero‑shot alignment detection compared to state‑of‑the‑art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.

Abstract:
We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text‑to‑audio, text‑to‑speech, zero‑shot speaker cloning, mixed speech‑and‑sound generation, scene‑level audio editing, speech‑in‑scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer‑wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM‑DiT blocks via learned projections, providing depth‑matched semantic conditioning that improves instruction following over single‑layer baselines; and (2) a unified multi‑task architecture where task identity is encoded solely by a channel‑wise mask and source audio is provided through VAE‑encoded channel concatenation. Training is stabilized by an online GPU‑side multi‑task data synthesis pipeline with task‑homogeneous batching and a two‑stage curriculum. With 621M‑‑732M trainable parameters, UNISON achieves results competitive with or exceeding task‑specialist models across evaluated domains, while being roughly 4× smaller than comparable unified systems.

Abstract:
Recent advancements in text‑guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment‑aware text‑to‑speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross‑modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript‑aligned speech latent with text‑conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain‑specific representation alignment objective tailored to environment‑aware TTS, leveraging complementary self‑supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

Abstract:
Real‑time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high‑fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high‑quality spatial audio generation. 2) We design a Spatial Video‑Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi‑objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video‑to‑spatial and text‑to‑spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.

Abstract:
Generative video‑to‑audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single‑video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state‑of‑the‑art models reveals a consistent trade‑off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics‑based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos‑lab/flatsounds/

Abstract:
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine‑grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free‑form text prompts. In this paper, we introduce a new task: Free‑Form‑Text‑Prompt‑to‑Unified‑Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM‑based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain‑of‑thought mechanism, an implicit planning mechanism that bridges high‑level semantic understanding and low‑level acoustic synthesis. Furthermore, we create PlanAudio‑Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi‑scenario training curricula.

Abstract:
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor‑specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real‑world audio, we use a novel training strategy that uses conditional audio generation and source‑separated stems to strongly encourage single‑factor variation in training data. Our evaluations demonstrate strong factor‑wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real‑world audio.

Abstract:
Audio‑visual generation is rapidly advancing from short clips to minute‑long content, while existing evaluation protocols remain largely confined to short‑form settings. Existing benchmarks primarily focus on 5‑‑10 second text‑conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio‑visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV‑Compass, a systematic benchmark for minute‑long audio‑visual generation. LongAV‑Compass contains 284 curated test cases spanning text‑to‑audio‑video (T2AV), image‑to‑audio‑video (I2AV), and video‑to‑audio‑video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy‑guided benchmark construction with a unified evaluation framework that integrates MLLM‑assisted assessment with complementary perceptual and multimodal metrics, including DINO‑v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine‑grained dimensions covering within‑segment quality, cross‑segment consistency, global narrative coherence, semantic alignment, and audio‑visual synchronization. Through experiments on 11 representative models together with human‑alignment validation, LongAV‑Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute‑scale audio‑visual generation across diverse input modalities.

Abstract:
Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text‑to‑audio, text‑to‑music and text‑to‑speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high‑dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade‑off between over‑regularized (poor output quality) and under‑regularized (difficult to predict) latent representations. We propose a framework for studying this trade‑off through compression and train Audio VAEs at specific bitrates via target‑KL regularization. This allows direct comparison to well‑studied discrete neural audio codec models, and the construction of rate‑distortion curves for audio VAEs. We evaluate the impact of target‑KL regularization on text‑to‑sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.

Abstract:
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time‑aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer‑based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre‑trained codec decoder. We experiment with multiple state‑of‑the‑art neural codecs, namely EnCodec, DAC, and X‑Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E‑GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec‑token prediction as an effective route for drum grid‑to‑audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.

Abstract:
Recent progress in diffusion‑based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text‑conditioned audio generation and audio‑conditioned super‑resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation‑oriented refinement. Early training places stronger emphasis on acquiring condition‑aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine‑detail refinement. To characterize this evolving balance, we introduce a progress‑based regime variable derived from the training‑time slope of an SSL‑space discrepancy, which measures semantic progress during training. Based on this signal, we develop three complementary stage‑aware mechanisms: decayed SSL guidance for early semantic bootstrapping, self‑adaptive timestep sampling driven by the regime variable, and structure‑aware regularization activated from convergent grouped organization in parameter space. We evaluate these mechanisms on text‑conditioned audio generation and audio‑conditioned super‑resolution. Across both settings, the proposed stage‑aware strategies improve convergence behavior and yield gains on the primary generation and spectral reconstruction metrics over standard static baselines. These results support the view that efficient audio diffusion training can benefit from treating external guidance, internal organization, and optimization emphasis as stage‑dependent components rather than fixed ingredients.

Abstract:
Recent advances in multimodal generation have enabled high‑quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post‑hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio‑LABEL (LAtent‑Based Event Labeling), an event‑aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame‑aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17‑class material classification. Our approach improves onset‑detection accuracy from 46.7% to 75.0% and material‑classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video‑to‑audio synthesis.

Abstract:
Autoregressive (AR) models with diffusion heads have recently achieved strong text‑to‑audio performance, yet their iterative decoding and multi‑step sampling process introduce high‑latency issues. To address this bottleneck, we propose a one‑step sampling framework that combines an energy‑distance training objective with representation‑level distillation. An energy‑scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text‑to‑audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one‑step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi‑step sampling. Compared to the state‑of‑the‑art AR diffusion system, IMPACT, our approach achieves up to 8.5x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy‑distance training with representation‑level distillation provides an effective recipe for fast, high‑quality text‑to‑audio synthesis.

Abstract:
Navigational aids for blind and low vision individuals struggle conveying dynamic real‑world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real‑time video‑to‑audio framework that converts mobile device video into contextually relevant sound effects or text‑to‑speech descriptions. We propose a motion‑aware pipeline using a lightweight AI classification model to distinguish between low and high‑movement scenes followed by a real‑time text‑to‑audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high‑movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder‑only transformer‑based vision‑language model with mixture‑of‑experts and cross‑modal attention for visual understanding, in conjunction with neural text‑to‑speech and natural sound synthesis networks. The proposed framework uses prompt‑based caching and category‑specific throttling to avoid auditory clutter and minimize latency. We present a comprehensive evaluation of the system, including a real‑time navigation study comparing a white cane alone versus with AMAVA, that shows a significant increase in user confidence and perceived safety.

Authors: Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hongxiang Hao, Haoxun He, Jiaao He, Qian He, Tuyen Hoang, Heng Hu, Ruoqing Hu, Yuxiang Hu, Jiancheng Huang, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Jishuo Jin, Ming Jing, Ashley Kim, Shanshan Lao, Yichong Leng, Bingchuan Li, Gen Li, Haifeng Li, Huixia Li, Jiashi Li, Ming Li, Xiaojie Li, Xingxing Li, Yameng Li, Yiying Li, Yu Li, Yueyan Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Wang Liao, J. H. Lien, Shanchuan Lin, Xi Lin, Feng Ling, Yue Ling, Fangfang Liu, Jiawei Liu, Jihao Liu, Jingtuo Liu, Shu Liu, Sichao Liu, Wei Liu, Xue Liu, Zuxi Liu, Ruijie Lu, Lecheng Lyu, Jingting Ma, Tianxiang Ma, Xiaonan Nie, Jingzhe Ning, Junjie Pan, Xitong Pan, Ronggui Peng, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Wenjing Tang, Boyang Tao, Zirui Tao, Dongliang Wang, Feng Wang, Hulin Wang, Ke Wang, Qingyi Wang, Rui Wang, Shuai Wang, Shulei Wang, Weichen Wang, Xuanda Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Zijie Wang, Ziyu Wang, Guoqiang Wei, Meng Wei, Di Wu, Guohong Wu, Hanjie Wu, Huachao Wu, Jian Wu, Jie Wu, Ruolan Wu, Shaojin Wu, Xiaohu Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Xin Xia, Xuefeng Xiao, Shuang Xu, Bangbang Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yihang Yang, Zhixian Yang, Ziyan Yang, Fulong Ye, Bingqian Yi, Xing Yin, Yongbin You, Linxiao Yuan, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Siyu Zhai, Zhonghua Zhai, Bowen Zhang, Chenlin Zhang, Heng Zhang, Jun Zhang, Manlin Zhang, Peiyuan Zhang, Shuo Zhang, Xiaohe Zhang, Xiaoying Zhang, Xinyan Zhang, Xinyi Zhang, Yichi Zhang, Zixiang Zhang, Haiyu Zhao, Huating Zhao, Liming Zhao, Yian Zhao, Guangcong Zheng, Jianbin Zheng, Xiaozheng Zheng, Zerong Zheng, Kuan Zhu, Feilong Zuo

Abstract:
Seedance 2.0 is a new native multi‑modal audio‑video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large‑scale architecture for multi‑modal audio‑video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi‑modal content reference and editing capabilities available in the industry to date. It delivers substantial, well‑rounded improvements across all key sub‑dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio‑video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi‑modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low‑latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi‑modal generation performance, bringing an enhanced creative experience for end users.

Abstract:
Audio tokenization has emerged as a critical component in end‑to‑end audio language models, enabling efficient discrete representation learning for both audio understanding and generation tasks. However, existing audio tokenizers face fundamental limitations in understanding tasks due to single‑modality constraints, particularly when audio signals contain ambiguous or incomplete information. While incorporating additional modality information can significantly enhance audio understanding, current multimodal fusion approaches invariably degrade reconstruction quality. This degradation is unacceptable for end‑to‑end audio systems that require high‑fidelity audio generation capabilities. In this work, we investigate the root causes of reconstruction quality degradation in video‑enhanced audio tokenization and present three key findings. First, the location of fusion within the tokenizer architecture is crucial for preserving reconstruction quality. Second, we show that contrastive learning, though effective in continuous representation fusion, is unsuitable for discrete tokenizers as it fails to enhance downstream task performance. Third, while feature‑dimension fusion approaches achieve moderate success, we discover that fusing along the temporal axis ‑‑ guided by the concept of distinctive features ‑‑ yields significantly better results. Building on these insights, we introduce the Timing‑Aware Pre‑Quantization Fusion for Video‑Enhanced Audio Tokenization, the first approach to successfully integrate visual information into audio tokenizer architectures while preserving reconstruction fidelity. Our approach not only maintains high‑fidelity reconstruction but also achieves superior performance on downstream understanding tasks compared with audio‑only tokenizers and established multimodal fusion baselines.

Abstract:
Video‑to‑Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine‑grained requirements of distinct audio categories. To address this gap, we propose VidAudio‑Bench, a multi‑task benchmark for V2A evaluation with four key features: (1) Broad Coverage: It encompasses four representative audio categories ‑ sound effects, music, speech, and singing ‑ under both V2A and Video‑Text‑to‑Audio (VT2A) settings. (2) Extensive Evaluation: It comprises 1,634 video‑text pairs and benchmarks 11 state‑of‑the‑art generation models. (3) Comprehensive Metrics: It introduces 13 task‑specific, reference‑free metrics to systematically assess audio quality, video‑audio consistency, and text‑audio consistency. (4) Human Alignment: It validates all metrics through subjective studies, demonstrating strong consistency with human preferences. Experimental results reveal that current V2A models perform poorly in speech and singing compared to sound effects. Our VT2A results further highlight a fundamental tension between instruction following and visually grounded generation: stronger visual conditioning improves video‑audio alignment, but often at the cost of generating the intended audio category. These findings establish VidAudio‑Bench as a comprehensive and scalable framework for diagnosing V2A systems and provide new insights into multimodal audio generation.

Abstract:
Recent advances in multimodal audio generation have enabled music synthesis from text, visual cues, and other high‑level conditions. However, most systems are designed for a single operating mode: either generating music without a reference mixture or extracting a target source from an existing mixture. This fixed‑task design limits their use when different combinations of text, visual, and mixture inputs are available. To address this gap, we propose MAGE, a modality‑agnostic framework for conditional music generation and mixture‑grounded target‑source extraction within a shared continuous latent space. Our approach introduces three key components. First, a Controlled Multimodal FluxFormer models the conditional flow from noise to a target audio latent, enabling the same backbone to operate with or without a mixture condition. Second, Audio‑Visual Nexus Alignment maps frame‑level visual features onto the audio latent sequence, allowing visual evidence to condition the generation process at the audio‑token level. Third, a cross‑gated modulation mechanism uses the aligned visual representation to regulate intermediate audio features, while text provides separate semantic guidance. We further train MAGE with dynamic modality masking, exposing the same model to text‑only, visual‑only, joint text‑visual, mixture‑conditioned, and unconditional configurations. Experiments on the MUSIC benchmark evaluate MAGE under separate protocols for mixture‑free generation and mixture‑grounded target‑source extraction. The results show that MAGE provides a shared conditioning interface across both settings, and that the proposed alignment and gating components improve interference suppression in the extraction task.

Abstract:
Audio‑video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion‑sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion‑aware structure shared by video and audio generation. We present Tora3, a trajectory‑guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video‑only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory‑aligned motion representation for video, a kinematic‑audio alignment module driven by trajectory‑derived second‑order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory‑conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large‑scale AV dataset emphasizing motion‑relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion‑sound synchronization, and overall AV generation quality over strong open‑source baselines.

Abstract:
Text‑to‑Audio‑Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine‑grained joint correctness required by realistic prompts. We introduce AVGen‑Bench, a task‑driven benchmark for T2AV generation featuring high‑quality prompts across 11 real‑world categories. To support comprehensive assessment, we propose a multi‑granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine‑grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio‑visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

Abstract:
Contrastively pretrained audio‑language models (e.g., CLAP) excel at clip‑level understanding but struggle with frame‑level tasks. Existing extensions fail to exploit the varying granularity of real‑world audio‑text data, where massive clip‑level textual descriptions coexist with limited frame‑level annotations. This paper proposes Fine‑grained Language‑Audio Pretraining (FineLAP), a novel training paradigm that advances both clip‑ and frame‑level alignment in CLAP with heterogeneous data. FineLAP introduces a dual‑stream sigmoid loss with a cluster‑based sampling strategy to jointly learn from clip‑ and frame‑level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self‑supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP‑100k, a large‑scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text‑to‑audio grounding. Ablation studies further show that coarse‑ and fine‑grained alignment are mutually beneficial, providing insights for building better audio‑language models (ALMs).

Abstract:
Recent text‑driven motion generation methods span both discrete token‑based approaches and continuous‑latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion‑based prior for text‑conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference‑time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion‑based motion quality under identical conditions. Moreover, flow‑based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency‑quality trade‑offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous‑latent text‑to‑motion generation, highlighting the importance of the training objective choice in motion priors.

Abstract:
Recent Video‑to‑Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high‑quality audio. However, they struggle with fine‑grained temporal control in multi‑event scenarios or when visual cues are insufficient, such as small regions, off‑screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT‑based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script‑Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi‑event scenarios, we further propose Bi‑Frame Sound Synthesis, enabling parallel in‑frame and out‑of‑frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.

Abstract:
Understanding and manipulating timbre is central to audio synthesis, yet this remains under‑explored in machine learning due to a lack of annotated datasets linking perceptual timbre dimensions to semantic descriptors. We present the Semantic Timbre Dataset, a curated collection of monophonic electric guitar sounds, each labeled with one of 19 semantic timbre descriptors and corresponding magnitudes. These descriptors were derived from a qualitative analysis of physical and virtual guitar effect units and applied systematically to clean guitar tones. The dataset bridges perceptual timbre and machine learning representations, supporting learning for timbre control and semantic audio generation. We validate the dataset by training a variational autoencoder (VAE) on its latent space and evaluating it using human perceptual judgments and descriptor classifiers. Results show that the VAE captures timbral structure and enables smooth interpolation across descriptors. We release the dataset, code, and evaluation protocols to support timbre‑aware generative AI research.

Abstract:
Multimodal generative models have shown remarkable progress in single‑modality video and audio synthesis, yet truly joint audio‑video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high‑quality, paired audio‑video datasets. The datasets consisting on 13 hours of video‑game clips and 64 hours of concert performances, each segmented into consistent 34‑second samples to facilitate reproducible research. Second, I train the MM‑Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio‑video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder‑decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two‑step text‑to‑audio‑video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high‑fidelity generations of audio video generation.

Abstract:
Recent advances in text‑to‑audio generation enable models to translate natural‑language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text‑to‑audio systems under controlled prompt perturbations. We selected MusicGen‑small, MusicGen‑large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen‑large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic‑to‑acoustic realization rather than multi‑modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text‑to‑audio generation and highlights the need for multi‑level stability assessment in generative audio systems.

Abstract:
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text‑to‑audio (TTA) generation remains largely under‑explored. Prior work typically employs offline methods like Direct Preference Optimization (DPO) and leverages Contrastive Language‑Audio Pretraining (CLAP) models as reward functions. In this study, we investigate the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching‑based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine‑grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model, Resonate, establishes a new SOTA on TTA‑Bench in terms of both audio quality and semantic alignment.

Abstract:
This paper introduces V2A‑DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow‑based video‑to‑audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore‑a comprehensive human preference‑aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore‑driven pipeline for generating large‑scale preference pair data for DPO optimization; (3) a curriculum learning‑empowered DPO optimization strategy specifically tailored for flow‑based generative models. Experiments on benchmark VGGSound dataset demonstrate that human‑preference aligned Frieren and MMAudio using V2A‑DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre‑trained baselines. Furthermore, our DPO‑optimized MMAudio achieves state‑of‑the‑art performance across multiple metrics, surpassing published V2A models.

Abstract:
Coordinated audio generation based on video inputs typically requires a strict audio‑visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two‑stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.

Abstract:
Text‑to‑audio diffusion models produce high‑fidelity audio but require tens of function evaluations (NFEs), incurring multi‑second latency and limited throughput. We present SoundWeaver, the first training‑free, model‑agnostic serving system that accelerates text‑to‑audio diffusion by warm‑starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration‑aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality‑aware eviction and refinement. On real‑world audio traces, SoundWeaver achieves 1.8‑‑3.0 × latency reduction with a cache of only ～1K entries while preserving or improving perceptual quality.

Abstract:
Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self‑Flow: a self‑supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual‑Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi‑modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Abstract:
Advances in multi‑modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real‑time multi‑modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real‑time podcast video generation, integrating LLMs, text‑to‑speech, and video‑audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource‑aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade‑offs between latency, cost, and quality. The cheapest setup generates a 10‑minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real‑time) for less than \25. StreamWise enables high‑quality real‑time streaming with a sub‑second startup delay under 45.

Abstract:
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top‑performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.

Abstract:
Diffusion models are typically trained using pointwise reconstruction objectives that are agnostic to the spectral and multi‑scale structure of natural signals. We propose a loss‑level spectral regularization framework that augments standard diffusion training with differentiable Fourier‑ and wavelet‑domain losses, without modifying the diffusion process, model architecture, or sampling procedure. The proposed regularizers act as soft inductive biases that encourage appropriate frequency balance and coherent multi‑scale structure in generated samples. Our approach is compatible with DDPM, DDIM, and EDM formulations and introduces negligible computational overhead. Experiments on image and audio generation demonstrate consistent improvements in sample quality, with the largest gains observed on higher‑resolution, unconditional datasets where fine‑scale structure is most challenging to model.

Abstract:
Text‑to‑audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text‑audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)‑based architectures. Our method first employs a large language model (LLM) to generate high‑fidelity, richly detailed audio captions, substantially improving text‑audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine‑tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO‑based fine‑tuning yield substantial gains in synthesis fidelity and prompt adherence.

Abstract:
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training‑free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence‑adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video‑text benchmarks, including adversarial and out‑of‑distribution settings. We also demonstrate UMPIRE's generalization to non‑text output tasks, including image and audio generation.

Abstract:
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text‑to‑audio (TTA), text‑to‑music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)‑based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general‑purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE‑based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.

Authors: Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, Yikun Dou, Zheng Chen, Mingyuan Fan, Tuanhui Li, Mingshan Chang, Hao Zhang, Xiaopeng Sun, Jingtao Xu, Yuqiang Xie, Jiahua Wang, Zhiheng Xu, Weiming Xiong, Yuzhe Jin, Baoxuan Gu, Binjie Mao, Yunjie Yu, Jujie He, Yuhao Feng, Shiwen Tu, Chaojie Wang, Rui Yan, Wei Shen, Jingchen Wu, Peng Zhao, Xuanyue Zhong, Zhuangzhuang Liu, Kaifei Wang, Fuxiang Zhang, Weikai Xu, Wenyan Liu, Binglu Zhang, Yu Shen, Tianhui Xiong, Bin Peng, Liang Zeng, Xuchen Song, Haoxiang Guo, Peiyu Wang, Max W. Y. Lam, Chien-Hung Liu, Yahui Zhou

Abstract:
SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MLLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MLLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long‑duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high‑resolution keyframes, followed by dedicated super‑resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi‑modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

Abstract:
Diffusion Transformers have achieved state‑of‑the‑art performance in class‑conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class‑conditioned embeddings exhibit extreme angular similarity, exceeding 99% on ImageNet‑1K, while continuous‑condition tasks such as pose‑guided image generation and video‑to‑audio generation reach over 99.9%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low‑magnitude dimensions‑‑removing up to two‑thirds of the embedding space‑‑we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer‑based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.

Abstract:
Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame‑level video information. In this work, we tackle the scaling challenge in multimodal‑to‑audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so‑called MMHNet, an enhanced extension of state‑of‑the‑art video‑to‑audio models. Our approach integrates a hierarchical method and non‑causal Mamba to support long‑form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video‑to‑audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long‑video to audio benchmarks, beating prior works in video‑to‑audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video‑to‑audio methods fall short in generating with long durations.

Abstract:
Current audio language models are predominantly text‑first, either extending pre‑trained text LLM backbones or relying on semantic‑only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next‑token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross‑modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices ‑‑ data sources, text mixture ratios, and token composition ‑‑ establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning 3×10^18 to 3×10^20 FLOPs, finding that optimal data grows 1.6× faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks ‑‑ we demonstrate this by fine‑tuning for voice‑preserving speech‑to‑speech translation, using the same unified architecture.

Abstract:
We study the fine‑grained text‑to‑audio (T2A) generation task. While recent models can synthesize high‑quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre‑trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A‑ControlNet and T2A‑Adapter, and show that the T2A‑Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A‑Adapter achieves state‑of‑the‑art performance on the AudioSet‑Strong in both event‑level and segment‑level F1 scores. We further extend this framework to audio editing, proposing T2A‑Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.

Abstract:
Recent advances in end‑to‑end trained omni‑models have significantly improved multimodal understanding. At the same time, safety red‑teaming has expanded beyond text to encompass audio‑based jailbreak attacks. However, an important bridge between textual and audio jailbreaks remains underexplored. In this work, we study the cross‑modality transfer of jailbreak attacks from text to audio, motivated by the semantic similarity between the two modalities and the maturity of textual jailbreak methods. We first analyze the connection between modality alignment and cross‑modality jailbreak transfer, showing that strong alignment can inadvertently propagate textual vulnerabilities to the audio modality, which we term the alignment curse. Guided by this analysis, we conduct an empirical evaluation of textual jailbreaks, text‑transferred audio jailbreaks, and existing audio‑based jailbreaks on recent omni‑models. Our results show that text‑transferred audio jailbreaks perform comparably to, and often better than, audio‑based jailbreaks, establishing them as simple yet powerful baselines for future audio red‑teaming. We further demonstrate strong cross‑model transferability and show that text‑transferred audio attacks remain effective even under a stricter audio‑only access threat model.

Abstract:
Recent advances in text‑to‑audio‑video (T2AV) generation have enabled models to synthesize audio‑visual videos with multi‑participant dialogues. However, existing evaluation benchmarks remain largely designed for human‑recorded videos or single‑speaker settings. As a result, structural failures in generated multi‑talker dialogue videos, such as identity drift, unnatural turn transitions, and audio‑visual misalignment, cannot be effectively diagnosed. To address this issue, we introduce MTAVG‑Bench, a failure‑driven diagnostic benchmark for multi‑talker dialogue‑centric audio‑video generation. MTAVG‑Bench is built via a semi‑automatic pipeline, where 1.8k videos are generated using mainstream T2AV models with carefully designed prompts, yielding 2.4k manually annotated QA pairs for fine‑grained failure diagnosis. The benchmark evaluates multi‑speaker dialogue generation at four levels: audio‑visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. Built on a hierarchical failure taxonomy and a targeted QA protocol, MTAVG‑Bench is primarily designed to evaluate whether proprietary and open‑source omni‑models can reliably identify failure modes in multi‑speaker T2AV outputs. We benchmark 12 proprietary and open‑source omni‑models on MTAVG‑Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open‑source models remain competitive in signal fidelity and consistency. Overall, MTAVG‑Bench enables fine‑grained failure analysis for rigorous model comparison and targeted video generation refinement.

Abstract:
Large audio‑language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text‑to‑audio jailbreak that embeds disallowed directives within a narrative‑style audio stream. The attack leverages an advanced instruction‑following text‑to‑speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state‑of‑the‑art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text‑only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech‑based interfaces become more prevalent.

Abstract:
Large‑scale biodiversity monitoring platforms increasingly rely on multimodal wildlife observations. While recent foundation models enable rich semantic representations across vision, audio, and language, retrieving relevant observations from massive archives remains challenging due to the computational cost of high‑dimensional similarity search. In this work, we introduce compact hypercube embeddings for fast text‑based wildlife observation retrieval, a framework that enables efficient text‑based search over large‑scale wildlife image and audio databases using compact binary representations. Building on the cross‑view code alignment hashing framework, we extend lightweight hashing beyond a single‑modality setup to align natural language descriptions with visual or acoustic observations in a shared Hamming space. Our approach leverages pretrained wildlife foundation models, including BioCLIP and BioLingual, and adapts them efficiently for hashing using parameter‑efficient fine‑tuning. We evaluate our method on large‑scale benchmarks, including iNaturalist2024 for text‑to‑image retrieval and iNatSounds2024 for text‑to‑audio retrieval, as well as multiple soundscape datasets to assess robustness under domain shift. Results show that retrieval using discrete hypercube embeddings achieves competitive, and in several cases superior, performance compared to continuous embeddings, while drastically reducing memory and search cost. Moreover, we observe that the hashing objective consistently improves the underlying encoder representations, leading to stronger retrieval and zero‑shot generalization. These results demonstrate that binary, language‑based retrieval enables scalable and efficient search over large wildlife archives for biodiversity monitoring systems.

Abstract:
In recent years, Text‑to‑Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high‑level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine‑grained acoustic details. SemanticAudio employs a two‑stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high‑fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training‑free text‑guided editing mechanism that enables precise attribute‑level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in semantic alignment. Demo available at: https://semanticaudio1.github.io/

Abstract:
Although text‑to‑audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely‑adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine‑grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone‑agnostic evaluation framework that leverages the reasoning capabilities of audio‑aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open‑ended text generation, it estimates alignment by computing the exact log‑probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human‑rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity‑based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.

Abstract:
Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound‑object category, revealing a form of implicit planning. Building on this insight, we propose Plan‑Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)‑inspired objective to predict final instruction‑following quality from partial generations. At inference time, Plan‑Critic enables guided exploration: it evaluates candidate prefixes early, prunes low‑fidelity trajectories, and reallocates computation to high‑potential planning seeds. Our Plan‑Critic‑guided sampling achieves up to a 10‑point improvement in CLAP score over the AR baseline‑establishing a new state of the art in AR text‑to‑audio generation‑while maintaining computational parity with standard best‑of‑N decoding. This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.

Abstract:
In the field of audio generation, signal‑to‑noise ratio (SNR) has long served as an objective metric for evaluating audio quality. Nevertheless, recent studies have shown that SNR and its variants are not always highly correlated with human perception, prompting us to raise the questions: Why does SNR fail in measuring audio quality? And how to improve its reliability as an objective metric? In this paper, we identify the inadequate measurement of phase distance as a pivotal factor and propose to reformulate SNR with specially designed phase‑distance terms, yielding an improved metric named GOMPSNR. We further extend the newly proposed formulation to derive two novel categories of loss function, corresponding to magnitude‑guided phase refinement and joint magnitude‑phase optimization, respectively. Besides, extensive experiments are conducted for an optimal combination of different loss functions. Experimental results on advanced neural vocoders demonstrate that our proposed GOMPSNR exhibits more reliable error measurement than SNR. Meanwhile, our proposed loss functions yield substantial improvements in model performance, and our wellchosen combination of different loss functions further optimizes the overall model capability.

Authors: Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman

Abstract:
Recent text‑to‑video diffusion models can generate compelling video sequences, yet they remain silent ‑‑ missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX‑2, an open‑source foundational model capable of generating high‑quality, temporally synchronized audiovisual content in a unified manner. LTX‑2 consists of an asymmetric dual‑stream transformer with a 14B‑parameter video stream and a 5B‑parameter audio stream, coupled through bidirectional audio‑video cross‑attention layers with temporal positional embeddings and cross‑modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality‑aware classifier‑free guidance (modality‑CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX‑2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene ‑‑ complete with natural background and foley elements. In our evaluations, the model achieves state‑of‑the‑art audiovisual quality and prompt adherence among open‑source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

Abstract:
Training a unified model integrating video‑to‑audio (V2A), text‑to‑audio (T2A), and joint video‑text‑to‑audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high‑quality audio captions with tight V‑A‑T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross‑task and intra‑task competition, manifesting as an adverse V2A‑T2A performance trade‑off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large‑scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision‑to‑Language Compression to mitigate visual bias of MLLMs, a Junior‑Senior Agent Handoff for a 5× cost reduction, and rigorous Post‑hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V‑A‑T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross‑task and intra‑task competition, we design a three‑stage multi‑task progressive training schedule that converts cross‑task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio‑visual alignment and off‑screen audio generation faithfulness. Finally, we construct VGGSound‑Omni, a comprehensive benchmark for unified evaluation, including challenging off‑screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.

Abstract:
Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video‑text‑to‑audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine‑grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video‑grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine‑grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley‑6k, a large‑scale, expert‑curated benchmark containing over 6,000 video‑instruction‑annotation triplets. Building upon this foundation, we propose EchoVidia a sounding‑event‑centric agentic generation framework with slow‑fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

Abstract:
Recent progress in audio generation models has made it possible to create highly realistic and immersive soundscapes, which are now widely used in film and virtual‑reality‑related applications. However, these audio generators also raise concerns about potential misuse, such as producing deceptive audio for fabricated videos or spreading misleading information. Therefore, it is essential to develop effective methods for detecting fake environmental sounds. Existing datasets for environmental sound deepfake detection (ESDD) remain limited in both scale and the diversity of sound categories they cover. To address this gap, we introduced EnvSDD, the first large‑scale curated dataset designed for ESDD. Based on EnvSDD, we launched the ESDD Challenge, recognized as one of the ICASSP 2026 Grand Challenges. This paper presents an overview of the ESDD Challenge, including a detailed analysis of the challenge results.

Abstract:
Text‑to‑audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion‑based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful Generative Adversarial Networks (GANs)‑based TTA framework that generates audio in a single pass, thereby reducing model complexity and inference time. To overcome the inherent difficulties in training GANs, we integrate multiple ,contrastive losses and propose innovative components Single‑Double‑Triple (SDT) Attention and Time‑Frequency Cross‑Attention (TF‑CA). Extensive experiments on the AudioCaps dataset demonstrate that AudioGAN achieves state‑of‑the‑art performance while using 90% fewer parameters and running 20 times faster, synthesizing audio in under one second. These results establish AudioGAN as a practical and powerful solution for real‑time TTA.

Abstract:
Text‑to‑Audio‑Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross‑modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV‑Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy‑driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV‑Compass introduces a dual‑level evaluation framework that integrates objective signal‑level metrics for video quality, audio quality, and cross‑modal alignment with a subjective MLLM‑as‑a‑Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human‑level realism and cross‑modal consistency, with persistent failures in audio realism, fine‑grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV‑Compass as a challenging and diagnostic testbed for advancing text‑to‑audio‑video generation.

Abstract:
Many existing audio processing and generation models rely on task‑specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high‑quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high‑fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder‑only autoregressive (AR) LM‑based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H‑Codec, which incorporates self‑supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H‑Codec, such as a dynamic frame‑rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task‑specific conditional information as the conditioning sequence of the decoder‑only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language‑queried audio source separation (LASS). In addition, we extend downstream tasks to universal free‑form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H‑Codec achieves high‑quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state‑of‑the‑art task‑specific or multi‑task systems across multiple tasks.

Abstract:
Recent advances in audio generation have increased the risk of realistic environmental sound manipulation, motivating the ESDD 2026 Challenge as the first large‑scale benchmark for Environmental Sound Deepfake Detection (ESDD). We propose BEAT2AASIST which extends BEATs‑AASIST by splitting BEATs‑derived representations along frequency or channel dimension and processing them with dual AASIST branches. To enrich feature representations, we incorporate top‑k transformer layer fusion using concatenation, CNN‑gated, and SE‑gated strategies. In addition, vocoder‑based data augmentation is applied to improve robustness against unseen spoofing methods. Experimental results on the official test sets demonstrate that the proposed approach achieves competitive performance across the challenge tracks.

Abstract:
Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio‑video generation, especially for models aiming to generate synchronized audio‑video outputs. To address this gap, we introduce VABench, a comprehensive and multi‑dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio‑video generation. VABench encompasses three primary task types: text‑to‑audio‑video (T2AV), image‑to‑audio‑video (I2AV), and stereo audio‑video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text‑video, text‑audio, video‑audio), audio‑video synchronization, lip‑speech consistency, and carefully curated audio and video question‑answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Abstract:
We present SDialog, an MIT‑licensed open‑source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end‑to‑end framework for building and analyzing LLM‑based conversational agents. Built around a standardized \textttDialog representation, SDialog provides: (1) persona‑driven multi‑agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM‑as‑a‑judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed‑backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog‑centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

Abstract:
Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video‑synchronized audio, including Kling‑foley, HunyuanVideo‑foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision‑language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual‑visual encoder module that effectively captures both audio‑aligned and text‑aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay‑pattern generation scheme to balance the trade‑off between training efficiency and audio quality. Moreover, we introduce the classifier‑free guidance strategy into VLMs to bootstrap generated audio quality. Furthermore, we establish an efficient data production pipeline to scale audio‑video‑text triple collection. Finally, extensive experiments are conducted to validate the effectiveness of our model, achieving promising performance across popular benchmarks. We hope that the findings in this study provide a strong foundation for future video‑to‑audio generation research. We also release the previously missing audio‑visual textual descriptions from the public benchmark, aiming to facilitate subsequent researchers in conducting more convenient and effective evaluations and comparisons.

Abstract:
We propose MAViD, a novel Multimodal framework for Audio‑Visual Dialogue understanding and generation. Existing approaches primarily focus on non‑interactive systems and are limited to producing constrained and unnatural human speech. The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio‑video fusion. To solve these problems, we propose a Conductor‑Creator architecture that divides the dialogue system into two primary components. The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine‑grained control over interactions. The Creator then delivers interactive responses based on these instructions. Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high‑quality video generation. Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long‑duration audio‑visual content generation. Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long‑duration dialogue interactions and accurately interpret users' multimodal queries.

Abstract:
This work introduces a new task, text‑conditioned selective video‑to‑audio (V2A) generation, which produces only the user‑intended sound from a multi‑object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text‑conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt‑relevant sound‑source visual features from the video encoder. To suppress text‑irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross‑attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video‑mixing scheme in a self‑supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on VGG‑MONOAUDIO, a curated benchmark of clean single‑source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization.

Abstract:
With the rise of AI‑generated content (AIGC), generating perceptually natural and feeling‑aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling‑aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling‑aligned image‑music‑text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross‑modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel‑spectrograms with a frequency‑weighted L1 loss to enhance high‑frequency fidelity. In the second stage, a fine‑tuned HiFi‑GAN vocoder reconstructs high‑quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel‑Cepstral Distortion, Frechet Audio Distance, Log‑Spectral Distance, and cosine similarity. A small LLM‑based rating study further verifies consistent cross‑modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling‑aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.

Abstract:
The synthesis of synchronized audio‑visual content is a key challenge in generative AI, with open‑source models facing challenges in robust audio‑video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine‑grained temporal cues; and (3) the intra‑modal bias of conventional Classifier‑Free Guidance (CFG), which enhances conditionality but not cross‑modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio‑visual synchronization. We first propose a Cross‑Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio‑driven video and video‑driven audio generation tasks. Then, we design a Global‑Local Decoupled Interaction Module for efficient and precise temporal‑style alignment. Finally, we present a novel Synchronization‑Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state‑of‑the‑art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine‑grained audio‑visual synchronization.

Abstract:
Sound effect editing‑modifying audio by adding, removing, or replacing elements‑remains constrained by existing approaches that rely solely on low‑level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV‑Edit, a generative sound effect editing framework that enables fine‑grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio‑visual masking autoencoder (CAV‑MAE‑Edit) for multimodal pre‑training, learning aligned cross‑modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM‑DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation‑based feature gating training strategy. Furthermore, we construct a dedicated video‑based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV‑Edit generates high‑quality audio with precise modifications based on visual content, achieving state‑of‑the‑art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.

Abstract:
Video‑to‑Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio‑visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain‑of‑Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT‑reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast‑GRPO, which employs hybrid ODE‑SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single‑event classes and 501 multi‑event samples. Experimental results demonstrate that PrismAudio achieves state‑of‑the‑art performance across all four perceptual dimensions on both the in‑domain VGGSound test set and out‑of‑domain AudioCanvas benchmark. The project page is available at https://PrismAudio.github.io.

Abstract:
This paper investigates the emerging text‑to‑audio paradigm in artificial intelligence (AI), examining its transformative implications for musical creation, interpretation, and cognition. I explore the complex semantic and semiotic interplays that occur when descriptive natural language prompts are translated into nuanced sound objects across the text‑to‑audio modality. Drawing from structuralist and post‑structuralist perspectives, as well as cognitive theories of schema dynamics and metacognition, the paper explores how these AI systems reconfigure musical signification processes and navigate established cognitive frameworks. The research analyzes some of the cognitive dynamics at play in AI‑mediated musicking, including processes of schema assimilation and accommodation, metacognitive reflection, and constructive perception. The paper argues that text‑to‑audio AI models function as quasi‑objects of musical signification, simultaneously stabilizing and destabilizing conventional forms while fostering new modes of listening and aesthetic reflexivity.Using Udio as a primary case study, this study explores how these models navigate the liminal spaces between linguistic prompts and sonic outputs. This process not only generates novel musical expressions but also prompts listeners to engage in forms of critical and "structurally‑aware listening.", encouraging a deeper understanding of music's structures, semiotic nuances, and the socio‑cultural contexts that shape our musical cognition. The paper concludes by reflecting on the potential of text‑to‑audio AI models to serve as epistemic tools and quasi‑objects, facilitating a significant shift in musical interactions and inviting users to develop a more nuanced comprehension of the cognitive and cultural foundations of music.

Abstract:
This paper presents a pedagogical and conceptual account of the course AI in Music and Sound: Modalities, Tools and Creative Applications, offered within the Music Informatics and Media Art module of an M.Sc. in Audio Communication. The course engaged students with a range of AI modalities such as symbolic composition, voice synthesis, timbre transfer, neural audio synthesis, and text‑to‑audio systems, combining theoretical reflection with practice‑based experimentation. Its central pedagogical move is a paired‑études design: each modality is approached first through its intended affordances and then through a deliberately reframed or "misused" exercise that surfaces representational limits and alternative behaviours. Framed by medium theory and post‑structuralist inquiry, we treated AI as a transmodal conduit‑a system that translates and perturbs musical signs across textual, symbolic, timbral and audio domains. Evidence from student work and reflection indicates growth in technical fluency, medium awareness, and critical literacy, alongside the cultivation of experimental method and process‑oriented listening. The paper outlines the course architecture, assessment design, and representative projects, and distils a set of design patterns for AI‑music pedagogy (eg., prompt‑conditioned interplays and semantic destabilisation in text‑to‑audio; latent space materialism in timbre transfer). It concludes with pedagogical recommendations that integrate creative practice with medium awareness and with cultural‑epistemic analysis of AI technologies, preparing students to participate in how AI is understood, developed, and deployed with creative communities.

Abstract:
Text‑to‑audio (TTA) systems are rapidly transforming music creation and distribution, with platforms like Udio and Suno generating thousands of tracks daily and integrating into mainstream music platforms and ecosystems. These systems, trained on vast and largely undisclosed datasets, are fundamentally reshaping how music is produced, reproduced and consumed. This paper presents empirical evidence that artist‑conditioned regions can be systematically microlocated through metatag‑based prompt design, effectively enabling the spawning of artist‑like content through strategic prompt engineering. Through systematic exploration of metatag‑based prompt engineering techniques this research reveals how users can access the distinctive sonic signatures of specific artists, evidencing their inclusion in training datasets. Using descriptor constellations drawn from public music taxonomies, the paper demonstrates reproducible proximity to artists such as Bon Iver, Philip Glass, Panda Bear and William Basinski. The results indicate stable text‑audio correspondences consistent with artist‑specific training signals, enabling precise traversal of stylistic microlocations without explicitly naming artists. This capacity to summon artist‑specific outputs shows that artists' creative works fuction as foundational material from which these systems generate new content, often without explicit consent or attribuition. Conceptually, the work clarifies how textual descriptors act as navigational cues in high‑dimensional representation spaces; methodologically, it provides a replicable protocol for auditing stylistic inducibility. The findings raise immediate queestions for governance‑attribution, consent and disclosure standards‑and for creative practice, where induced stylistic proximity complicates boundaries between ownership, reproduction, imitation, creative agency and the ethics of algorithmic creation.

Abstract:
Video‑to‑audio generation (V2A) is of increasing importance in domains such as film post‑production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on‑screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley‑style scenarios. We find that 74% of videos from past evaluation datasets have poor audio‑visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large‑scale benchmark explicitly designed for Foley‑style V2A evaluation. FoleyBench contains 5,000 (video, ground‑truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on‑screen events. The dataset is built using an automated, scalable pipeline applied to in‑the‑wild internet videos from YouTube‑based and Vimeo‑based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine‑grained analysis of model performance and failure modes. We benchmark several state‑of‑the‑art V2A models, evaluating them on audio quality, audio‑video alignment, temporal synchronization, and audio‑text consistency. Samples are available at: https://gclef‑cmu.org/foleybench

Abstract:
The Vision‑Language‑Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision‑only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio‑VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio‑VLA overcomes the vision‑only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio‑VLA employs pre‑trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine‑tuning to these pre‑trained modules to achieve robust cross‑modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision‑based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real‑world tasks demonstrate Audio‑VLA's superior performance over vision‑only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities.

Abstract:
We propose a text‑to‑talking‑face synthesis framework leveraging latent speech representations from HierSpeech++. A Text‑to‑Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS‑predicted features, we adopt a two‑stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio‑visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground‑truth audio at inference. Experiments show that conditioning on TTS‑predicted latent features outperforms cascaded pipelines, improving both lip‑sync and visual realism.

Abstract:
We propose a general feedback‑driven retrieval‑augmented generation (RAG) approach that leverages Large Audio Language Models (LALMs) to address the missing or imperfect synthesis of specific sound events in text‑to‑audio (TTA) generation. Unlike previous RAG‑based TTA methods that typically train specialized models from scratch, we utilize LALMs to analyze audio generation outputs, retrieve concepts that pre‑trained models struggle to generate from an external database, and incorporate the retrieved information into the generation process. Experimental results show that our method not only enhances the ability of LALMs to identify missing sound events but also delivers improvements across different models, outperforming existing RAG‑specialized approaches.

Abstract:
Text‑to‑audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators' output space, especially for level generators. This paper adapts ERA to text‑to‑audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC‑50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA‑based exploratory evaluation of generative audio models.

Abstract:
While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human‑interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm‑VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state‑of‑the‑art text‑to‑music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

Abstract:
The rapid advancement of next‑token‑prediction models has led to widespread adoption across modalities, enabling the creation of realistic synthetic media. In the audio domain, while autoregressive speech models have propelled conversational interactions forward, the potential for misuse, such as impersonation in phishing schemes or crafting misleading speech recordings, has also increased. Security measures such as watermarking have thus become essential to ensuring the authenticity of digital media. Traditional statistical watermarking methods used for autoregressive language models face challenges when applied to autoregressive audio models, due to the inevitable ``retokenization mismatch'' ‑ the discrepancy between original and retokenized discrete audio token sequences. To address this, we introduce Aligned‑IS, a novel, distortion‑free watermark, specifically crafted for audio generation models. This technique utilizes a clustering approach that treats tokens within the same cluster equivalently, effectively countering the retokenization mismatch issue. Our comprehensive testing on prevalent audio generation platforms demonstrates that Aligned‑IS not only preserves the quality of generated audio but also significantly improves the watermark detectability compared to the state‑of‑the‑art distortion‑free watermarking adaptations, establishing a new benchmark in secure audio technology applications.

Abstract:
Text‑to‑audio (TTA) generation is advancing rapidly, but evaluation remains challenging because human listening studies are expensive and existing automatic metrics capture only limited aspects of perceptual quality. We introduce AudioEval, a large‑scale TTA evaluation dataset with 4,200 generated audio samples (11.7 hours) from 24 systems and 126,000 ratings collected from both experts and non‑experts across five dimensions: enjoyment, usefulness, complexity, quality, and text alignment. Using AudioEval, we benchmark diverse automatic evaluators to compare perspective‑ and dimension‑level differences across model families. We also propose Qwen‑DisQA as a strong reference baseline: it jointly processes prompts and generated audio to predict multi‑dimensional ratings for both annotator groups, modeling rater disagreement via distributional prediction and achieving strong performance. We will release AudioEval to support future research in TTA evaluation.

Abstract:
Recent advances in diffusion‑based generative models have enabled high‑quality text‑to‑audio synthesis, but fine‑grained acoustic control remains a significant challenge in open‑source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this "control gap" in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time‑varying control signals, loudness, pitch, spectral centroid, and timbre, for precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using Low‑Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85% of the original parameters to be trained. Experiments demonstrate that Audio Palette achieves fine‑grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as Frechet Audio Distance (FAD) and LAION‑CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence‑based conditioning, memory efficiency, and a three‑scale classifier‑free guidance mechanism for nuanced inference‑time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open‑source settings, enabling a more artist‑centric workflow in the broader context of music and sound information retrieval.

Abstract:
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text‑to‑audio generation as separate tasks. Very few studies attempt to unify these tasks ‑‑ an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text‑to‑audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM‑Gen, a text‑to‑audio language model that directly predicts audio tokens and is comparable to state‑of‑the‑art diffusion‑based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state‑of‑the‑art specialized models in audio understanding, text‑to‑audio generation, and text reasoning. Furthermore, we present UALM‑Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross‑modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

Abstract:
Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three‑dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large‑scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real‑world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine‑grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high‑quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.

Abstract:
Video‑to‑Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off‑screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority‑voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we introduce HALCON to mitigate IH. HALCON follows a three‑stage procedure: it first generates initial audio to expose hallucinated segments, then identifies and masks the corresponding unreliable video features, and finally regenerates the audio using the corrected conditioning. Experiments on several mainstream V2A benchmarks first reveal that state‑of‑the‑art models suffer from severe IH. In contrast, our HALCON method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

Abstract:
In this work, we present FoleyGRAM, a novel approach to video‑to‑audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video‑to‑audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion‑based audio synthesis model conditioned on GRAM‑aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video‑to‑audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video‑to‑audio synthesis.

Abstract:
Although audio generation has been widely studied over recent years, video‑aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high‑quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video‑aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross‑attention conditioning in a diffusion‑based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video‑to‑audio generation and resulting in a significantly more immersive and realistic audio experience.

Abstract:
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text‑to‑audio (T2A) generation, they still lag behind diffusion‑based models by a non‑trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM‑based framework that employs multiple isolated transformers with causal conditioning and anti‑causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM‑based and diffusion‑based T2A systems, achieving state‑of‑the‑art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi‑modal generation frameworks.

Abstract:
This paper presents a novel approach to neural instrument sound synthesis using a two‑stage semi‑supervised learning framework capable of generating pitch‑accurate, high‑quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high‑dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two‑stage training paradigm: first, we train a pitch‑timbre disentangled 2D representation of audio samples using a Variational Autoencoder; second, we use this representation as conditioning input for a Transformer‑based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model's ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering: https://pgesam.faresschulz.com

Abstract:
We propose SALSA‑V, a multimodal video‑to‑audio generation model capable of synthesizing highly synchronized, high‑fidelity long‑form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio‑conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high‑quality audio samples in as few as eight sampling steps, paving the way for near‑real‑time applications without requiring dedicated fine‑tuning or retraining. We demonstrate that SALSA‑V significantly outperforms existing state‑of‑the‑art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.

Abstract:
Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation‑aware audio generation, which explicitly conditions sound synthesis on object‑level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine‑grained and visually localized control over audio generation. To support this task and further research on segmentation‑aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state‑of‑the‑art methods and sets a new standard for controllable, high‑fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at https://saganet.notion.site

Abstract:
Research on audio generation has progressively developed along both waveform‑based and spectrogram‑based directions, giving rise to diverse strategies for representing and generating audio. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi‑channel AutoRegression on Spectrograms), which, to the best of our knowledge, is the first adaptation of next‑scale autoregressive modeling to the spectrogram domain. MARS treats spectrograms as multi‑channel images and employs channel multiplexing (CMX), a reshaping strategy that reduces spatial resolution without information loss. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer‑based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large‑scale dataset demonstrate that MARS performs comparably or better than state‑of‑the‑art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high‑fidelity sound generation.

Abstract:
This work pioneers the utilization of generative features in enhancing audio understanding. Unlike conventional discriminative features that directly optimize posterior and thus emphasize semantic abstraction while losing fine grained details, audio generation models inherently encode both spatiotemporal perception (capturing local acoustic texture across time and frequency) and semantic prior (knowing what to generate). It motivates us to explore the bridge of these complementary strengths. We provide a systematic investigation of their differences and complementary relationships, and ultimately propose an effective fusion strategy. Experiments across multiple tasks, including sound event classification, tagging, and particularly the fine grained task of audio captioning, demonstrate consistent performance gains. Beyond empirical improvements, this work more importantly introduces a new perspective on audio representation learning, highlighting that generative discriminative complementarity can provide both detailed perception and semantic awareness for audio understanding.

Abstract:
Video‑to‑audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large‑scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training‑free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug‑and‑play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

Abstract:
The design of diffusion‑based audio generation systems has been investigated from diverse perspectives, such as data space, network architecture, and conditioning techniques, while most of these innovations require model re‑training. In sampling, classifier‑free guidance (CFG) has been uniformly adopted to enhance generation quality by strengthening condition alignment. However, CFG often compromises diversity, resulting in suboptimal performance. Although the recent autoguidance (AG) method proposes another direction of guidance that maintains diversity, its direct application in audio generation has so far underperformed CFG. In this work, we introduce AudioMoG, an improved sampling method that enhances text‑to‑audio (T2A) and video‑to‑audio (V2A) generation quality without requiring extensive training resources. We start with an analysis of both CFG and AG, examining their respective advantages and limitations for guiding diffusion models. Building upon our insights, we introduce a mixture‑of‑guidance framework that integrates diverse guidance signals with their interaction terms (e.g., the unconditional bad version of the model) to maximize cumulative advantages. Experiments show that, given the same inference speed, our approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text‑to‑music, and image generation. Demo samples are available at: https://audiomog.github.io.

Abstract:
Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier‑Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt‑aware framework that predicts scale‑dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi‑metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt‑aware scale selection provides an effective, training‑free enhancement for pretrained diffusion backbones.

Abstract:
Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this dataset, we train a text‑to‑trajectory prediction model that outputs the three‑dimensional trajectory of a moving sound source given text prompts. To generate spatial audio, we first fine‑tune a pre‑trained text‑to‑audio generative model to output temporally aligned mono sound with the trajectory. The spatial audio is then simulated using the predicted temporally‑aligned trajectory. Experimental evaluation demonstrates reasonable spatial understanding of the text‑to‑trajectory model. This approach could be easily integrated into existing text‑to‑audio generative workflow and extended to moving sound generation in other spatial audio formats.

Abstract:
We present AIBA (Attention‑In‑Band Alignment), a lightweight, training‑free pipeline to quantify where text‑to‑audio diffusion models attend on the time‑frequency (T‑F) plane. AIBA (i) hooks cross‑attention at inference to record attention probabilities without modifying weights; (ii) projects them to fixed‑size mel grids that are directly comparable to audio energy; and (iii) scores agreement with instrument‑band ground truth via interpretable metrics (T‑F IoU/AP, frequency‑profile correlation, and a pointing game). On Slakh2100 with an AudioLDM2 backbone, AIBA reveals consistent instrument‑dependent trends (e.g., bass favoring low bands) and achieves high precision with moderate recall.

Abstract:
Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix‑up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precise temporal annotations and the risk of introducing noise through unreliable filtering. To address these challenges and enable generative‑based augmentation for SED, we propose SynSonic, a data augmentation method tailored for this task. SynSonic leverages text‑to‑audio diffusion models guided by an energy‑envelope ControlNet to generate temporally coherent sound events. A joint score filtering strategy with dual classifiers ensures sample quality, and we explore its practical integration into training pipelines. Experimental results show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.

Abstract:
We present StereoFoley, a video‑to‑audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video‑to‑audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object‑aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video‑to‑audio datasets. First, we develop a base model that generates stereo audio from video, achieving performance on par with state‑of‑the‑art V2A models in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance‑based loudness controls, enabling spatially accurate object‑aware sound. Finally, we fine‑tune the base model on this synthetic dataset, yielding clear object‑audio correspondence. Since no established metrics exist, we introduce a stereo object‑awareness metric and report it alongside a human listening study; the two evaluations exhibit consistent trends. This work establishes the first end‑to‑end framework for stereo object‑aware video‑to‑audio generation, addressing a critical gap in the field.

Abstract:
This paper introduces MR‑CQTdiff, a novel neural‑network architecture for diffusion‑based audio generation that leverages a multi‑resolution Constant‑Q Transform (CQT). The proposed architecture employs an efficient, invertible CQT framework that adjusts the time‑frequency resolution on an octave‑by‑octave basis. This design addresses the issue of low temporal resolution at lower frequencies, enabling more flexible and expressive audio generation. We conduct an evaluation using the Fréchet Audio Distance (FAD) metric across various architectures and two datasets. Experimental results demonstrate that MR‑CQTdiff achieves state‑of‑the‑art audio quality, outperforming competing architectures.

Abstract:
A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text‑to‑audio diffusion models by exploring the use of anti‑memorization strategies. We adopt Anti‑Memorization Guidance (AMG), a technique that modifies the sampling process of pre‑trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designed to reduce replication while preserving generation quality. We use Stable Audio Open as our backbone, leveraging its fully open‑source architecture and training dataset. Our comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion‑based text‑to‑audio generation without compromising audio fidelity or semantic alignment.

Abstract:
Diffusion models have shown remarkable progress in text‑to‑audio generation. However, text‑guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training‑based and zero‑shot methods that rely on full‑caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end‑to‑end efficient rectified flow matching‑based diffusion framework for audio editing, and construct a dataset featuring overlapping multi‑event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.

Abstract:
While many text‑to‑audio systems produce monophonic or fixed‑stereo outputs, generating audio with user‑defined spatial properties remains a challenge. Existing deep learning‑based spatialization methods often rely on latent‑space manipulations, which can limit direct control over psychoacoustic parameters critical to spatial perception. To address this, we introduce STASE, a system that leverages a Large Language Model (LLM) as an agent to interpret spatial cues from text. A key feature of STASE is the decoupling of semantic interpretation from a separate, physics‑based spatial rendering engine, which facilitates interpretable and user‑controllable spatial reasoning. The LLM processes prompts through two main pathways: (i) Description Prompts, for direct mapping of explicit spatial information (e.g., "place the lead guitar at 45° azimuth, 10 m distance"), and (ii) Abstract Prompts, where a Retrieval‑Augmented Generation (RAG) module retrieves relevant spatial templates to inform the rendering. This paper details the STASE workflow, discusses implementation considerations, and highlights current challenges in evaluating generative spatial audio.

Abstract:
Neural Audio Codecs (NACs) have become increasingly adopted in speech processing tasks due to their excellent rate‑distortion performance and compatibility with Large Language Models (LLMs) as discrete feature representations for audio generation. While most existing codecs rely on Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has recently emerged as a compelling alternative that simplifies training and natively supports single codebooks. We introduce NeuCodec, an FSQ‑based NAC, and show that FSQ encodes baked‑in redundancy which produces an encoding which is robust when transmitted through noisy channels. First, through an encoder distillation experiment, we show that two different encoders can learn to encode identical audio into vastly different code sequences whilst maintaining comparable reconstruction quality with the same quantizer and decoder. Second, we demonstrate that FSQ has vastly superior bit‑level perturbation robustness by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.

Abstract:
Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade‑off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state‑of‑the‑art discrete audio language models, facilitating lightweight, high‑quality audio generation. Samples are available at hf.co/spaces/kyutai/calm‑samples. Finally, we release Pocket TTS, an open‑source 100M‑parameter text‑to‑speech model that can run faster than real time on a laptop CPU: github.com/kyutai‑labs/pocket‑tts.

Abstract:
A key challenge in synthesizing audios from silent videos is the inherent trade‑off between synthesis quality and inference efficiency in existing methods. For instance, flow matching based models rely on modeling instantaneous velocity, inherently require an iterative sampling process, leading to slow inference speeds. To address this efficiency bottleneck, we introduce a MeanFlow‑accelerated model that characterizes flow fields using average velocity, enabling one‑step generation and thereby significantly accelerating multimodal video‑to‑audio (VTA) synthesis while preserving audio quality, semantic alignment, and temporal synchronization. Furthermore, a scalar rescaling mechanism is employed to balance conditional and unconditional predictions when classifier‑free guidance (CFG) is applied, effectively mitigating CFG‑induced distortions in one step generation. Since the audio synthesis network is jointly trained with multimodal conditions, we further evaluate it on text‑to‑audio (TTA) synthesis task. Experimental results demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality on both VTA and TTA synthesis tasks.

Abstract:
We present LatinX, a multilingual text‑to‑speech (TTS) model for cascaded speech‑to‑speech translation that preserves the source speaker's identity across languages. LatinX is a 12‑layer decoder‑only Transformer trained in three stages: (i) pre‑training for text‑to‑audio mapping, (ii) supervised fine‑tuning for zero‑shot voice cloning, and (iii) alignment with Direct Preference Optimization (DPO) using automatically labeled pairs based on Word Error Rate (WER) and speaker‑similarity metrics. Trained on English and Romance languages with emphasis on Portuguese, LatinX with DPO consistently reduces WER and improves objective similarity over the fine‑tuned baseline. Human evaluations further indicate stronger perceived speaker similarity than a strong baseline (XTTSv2), revealing gaps between objective and subjective measures. We provide cross‑lingual analyses and discuss balanced preference signals and lower‑latency architectures as future work.

Abstract:
Recent Video‑to‑Audio (V2A) generation relies on extracting semantic and temporal features from video to condition generative models. Training these models from scratch is resource intensive. Consequently, leveraging foundation models (FMs) has gained traction due to their cross‑modal knowledge transfer and generalization capabilities. One prior work has explored fine‑tuning a lightweight mapper network to connect a pre‑trained visual encoder with a text‑to‑audio generation model for V2A. Inspired by this, we introduce the Multiple Foundation Model Mapper (MFM‑Mapper). Compared to the previous mapper approach, MFM‑Mapper benefits from richer semantic and temporal information by fusing features from dual visual encoders. Furthermore, by replacing a linear mapper with GPT‑2, MFM‑Mapper improves feature alignment, drawing parallels between cross‑modal features mapping and autoregressive translation tasks. Our MFM‑Mapper exhibits remarkable training efficiency. It achieves better performance in semantic and temporal consistency with fewer training consuming, requiring only 16% of the training scale compared to previous mapper‑based work, yet achieves competitive performance with models trained on a much larger scale.

Abstract:
We present a system for automatic multi‑axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores‑‑Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness‑‑for audio generated by text‑to‑speech (TTS), text‑to‑audio (TTA), and text‑to‑music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer‑based audio representation model, with a multi‑branch long short‑term memory (LSTM) predictor and use a triplet loss with buffer‑based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain‑robust audio quality assessment without synthetic training data.

Abstract:
Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state‑of‑the‑art open‑source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out‑of‑domain scenarios, highlighting the need for extensive cross‑domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.

Abstract:
This is the summary paper for the AudioMOS Challenge 2025, the very first challenge for automatic subjective quality prediction for synthetic audio. The challenge consists of three tracks. The first track aims to assess text‑to‑music samples in terms of overall quality and textual alignment. The second track is based on the four evaluation dimensions of Meta Audiobox Aesthetics, and the test set consists of text‑to‑speech, text‑to‑audio, and text‑to‑music samples. The third track focuses on synthetic speech quality assessment in different sampling rates. The challenge attracted 24 unique teams from both academia and industry, and improvements over the baselines were confirmed. The outcome of this challenge is expected to facilitate development and progress in the field of automatic evaluation for audio generation systems.

Abstract:
Rapid advancements in generative modeling have made synthetic audio generation easy, making speech‑based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real‑world data. This study proposes a novel method for generalizable spoofing detection leveraging non‑semantic universal audio representations. Extensive experiments have been performed to find suitable non‑semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in‑domain test set while significantly outperforming state‑of‑the‑art approaches on out‑of‑domain test sets. Notably, it demonstrates superior generalization on public‑domain data, surpassing methods based on hand‑crafted features, semantic embeddings, and end‑to‑end architectures.

Abstract:
A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable‑length audio features into a concise fixed‑size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame‑level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual‑Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse‑grained, global statistical summaries and fine‑grained, attentive analyses of perceptually significant segments. This dual‑view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES‑Natural), MOS prediction backbones (including a CLAP‑based model and AudioBox‑Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system‑level Spearman's rank correlation coefficient (SRCC) over the widely‑used average pooling approach.

Abstract:
Controllable text‑to‑audio generation aims to synthesize audio from textual descriptions while satisfying user‑specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade‑offs among accurate temporal localization, open‑vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph‑guided diffusion transformer framework for open‑vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter‑event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high‑quality and diverse training data, we introduce a quality‑balanced data selection pipeline that combines hierarchical event annotation with multi‑criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state‑of‑the‑art performances across a variety of objective and subjective evaluation metrics.

Abstract:
Recently, with the advancement of AIGC, deep learning‑based video‑to‑audio (V2A) technology has garnered significant attention. However, existing research mostly focuses on mono audio generation that lacks spatial perception, while the exploration of binaural spatial audio generation technologies, which can provide a stronger sense of immersion, remains insufficient. To solve this problem, we propose FoleySpace, a framework for video‑to‑binaural audio generation that produces immersive and spatially consistent stereo sound guided by visual information. Specifically, we develop a sound source estimation method to determine the sound source 2D coordinates and depth in each video frame, and then employ a coordinate mapping mechanism to convert the 2D source positions into a 3D trajectory. This 3D trajectory, together with the monaural audio generated by a pre‑trained V2A model, serves as a conditioning input for a diffusion model to generate spatially consistent binaural audio. To support the generation of dynamic sound fields, we constructed a training dataset based on recorded Head‑Related Impulse Responses that includes various sound source movement scenarios. Experimental results demonstrate that the proposed method outperforms existing approaches in spatial perception consistency, effectively enhancing the immersive quality of the audio‑visual experience.

Abstract:
Real‑world multimodal applications often require any‑to‑any capabilities, enabling both understanding and generation across modalities including text, image, audio, and video. However, integrating the strengths of autoregressive language models (LLMs) for reasoning and diffusion models for high‑fidelity generation remains challenging. Existing approaches rely on rigid pipelines or tightly coupled architectures, limiting flexibility and scalability. We propose MAGUS (Multi‑Agent Guided Unified Multimodal System), a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation. MAGUS enables symbolic multi‑agent collaboration within a shared textual workspace. In the Cognition phase, three role‑conditioned multimodal LLM agents ‑ Perceiver, Planner, and Reflector ‑ engage in collaborative dialogue to perform structured understanding and planning. The Deliberation phase incorporates a Growth‑Aware Search mechanism that orchestrates LLM‑based reasoning and diffusion‑based generation in a mutually reinforcing manner. MAGUS supports plug‑and‑play extensibility, scalable any‑to‑any modality conversion, and semantic alignment ‑ all without the need for joint training. Experiments across multiple benchmarks, including image, video, and audio generation, as well as cross‑modal instruction following, demonstrate that MAGUS outperforms strong baselines and state‑of‑the‑art systems. Notably, on the MME benchmark, MAGUS surpasses the powerful closed‑source model GPT‑4o.

Abstract:
Evaluating audio generation systems, including text‑to‑music (TTM), text‑to‑speech (TTS), and text‑to‑audio (TTA), remains challenging due to the subjective and multi‑dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality‑aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre‑trained audio‑text models such as CLAP and Audiobox‑Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.

Abstract:
Despite recent advances, long‑sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi‑agent collaborative framework designed to assist in long‑sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle ‑‑ Explore, Examine, and Enhance ‑‑ to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state‑of‑the‑art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high‑quality, complete long‑sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output ‑‑ videos with narratives and background music.

Abstract:
The field of visual and audio generation is burgeoning with new state‑of‑the‑art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine‑grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Abstract:
Recent years have witnessed remarkable progress in Text‑to‑Audio Generation (TTA), providing sound creators with powerful tools to transform inspirations into vivid audio. Yet despite these advances, current TTA systems often suffer from slow inference speed, which greatly hinders the efficiency and smoothness of audio creation. In this paper, we present MeanAudio, a fast and faithful text‑to‑audio generator capable of rendering realistic sound with only one function evaluation (1‑NFE). MeanAudio leverages: (i) the MeanFlow objective with guided velocity target that significantly accelerates inference speed, (ii) an enhanced Flux‑style transformer with dual text encoders for better semantic alignment and synthesis quality, and (iii) an efficient instantaneous‑to‑mean curriculum that speeds up convergence and enables training on consumer‑grade GPUs. Through a comprehensive evaluation study, we demonstrate that MeanAudio achieves state‑of‑the‑art performance in single‑step audio generation. Specifically, it achieves a real‑time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion‑based TTA systems. Moreover, MeanAudio also shows strong performance in multi‑step generation, enabling smooth transitions across successive synthesis steps.

Abstract:
Singing Voice Conversion (SVC) transfers a source singer's timbre to a target while keeping melody and lyrics. The key challenge in any‑to‑any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing methods either face timbre leakage or fail to achieve satisfactory timbre similarity and quality in the generated audio. To address these challenges, we propose DAFMSVC, where the self‑supervised learning (SSL) features from the source audio are replaced with the most similar SSL features from the target audio to prevent timbre leakage. It also incorporates a dual cross‑attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content. Additionally, we introduce a flow matching module for high quality audio generation from the fused features. Experimental results show that DAFMSVC significantly enhances timbre similarity and naturalness, outperforming state‑of‑the‑art methods in both subjective and objective evaluations.

Abstract:
Recent advances in audio generation systems have enabled the creation of highly realistic and immersive soundscapes, which are increasingly used in film and virtual reality. However, these audio generators also raise concerns about potential misuse, such as generating deceptive audio content for fake videos and spreading misleading information. Existing datasets for environmental sound deepfake detection (ESDD) are limited in scale and audio types. To address this gap, we have proposed EnvSDD, the first large‑scale curated dataset designed for ESDD, consisting of 45.25 hours of real and 316.7 hours of fake sound. Based on EnvSDD, we are launching the Environmental Sound Deepfake Detection Challenge. Specifically, we present two different tracks: ESDD in Unseen Generators and Black‑Box Low‑Resource ESDD, covering various challenges encountered in real‑life scenarios. The challenge will be held in conjunction with the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026).

Abstract:
We present AudioGen‑Omni ‑ a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high‑fidelity audio, speech, and song coherently synchronized with the input video. AudioGen‑Omni introduces a novel joint training paradigm that seamlessly integrates large‑scale video‑text‑audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen‑Omni employs a unified lyrics‑transcription encoder that encodes graphemes and phonemes from both song and spoken inputs into dense frame‑level representations. Dense frame‑level representations are fused using an AdaLN‑based joint attention mechanism enhanced with phase‑aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross‑modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen‑Omni mitigates the semantic constraints of text‑frozen paradigms, enabling effective cross‑modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip‑sync accuracy, while also achieving state‑of‑the‑art results on Text‑to‑Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

Abstract:
Most existing text‑to‑audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text‑to‑multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.

Abstract:
Text‑to‑speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single‑codebook representations, which suffer from significant information loss. Even with post‑hoc refinement techniques such as flow matching, these methods fail to recover fine‑grained details (e.g., prosodic nuances, speaker‑specific timbres), especially in challenging scenarios like singing voice or music synthesis. We propose QTTS, a novel TTS framework built upon our new audio codec, QDAC. The core innovation of QDAC lies in its end‑to‑end training of an ASR‑based auto‑regressive network with a GAN, which achieves superior semantic feature disentanglement for scalable, near‑lossless compression. QTTS models these discrete codes using two innovative strategies: the Hierarchical Parallel architecture, which uses a dual‑AR structure to model inter‑codebook dependencies for higher‑quality synthesis, and the Delay Multihead approach, which employs parallelized prediction with a fixed delay to accelerate inference speed. Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline. This suggests that scaling up compression via multi‑codebook modeling is a promising direction for high‑fidelity, general‑purpose speech and audio generation.

Abstract:
Autoregressive next‑token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token‑wise diffusion to model the continuous distribution of the next continuous‑valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback‑Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next‑token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state‑of‑the‑art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters ‑‑ 193M for our Base and 462M for our Large models.

Abstract:
Spatial audio has become central to immersive applications such as VR/AR, cinema, and music. Existing generative audio models are largely limited to mono or stereo formats and cannot capture the full 3D localization cues available in first‑order Ambisonics (FOA). Recent FOA models extend text‑to‑audio generation but remain restricted to static sources. In this work, we introduce SonicMotion, the first end‑to‑end latent diffusion framework capable of generating FOA audio with explicit control over moving sound sources. SonicMotion is implemented in two variations: 1) a descriptive model conditioned on natural language prompts, and 2) a parametric model conditioned on both text and spatial trajectory parameters for higher precision. To support training and evaluation, we construct a new dataset of over one million simulated FOA caption pairs that include both static and dynamic sources with annotated azimuth, elevation, and motion attributes. Experiments show that SonicMotion achieves state‑of‑the‑art semantic alignment and perceptual quality comparable to leading text‑to‑audio systems, while uniquely attaining low spatial localization error.

Abstract:
This paper presents a physics‑informed neural network (PINN) for modeling first‑order Ambisonic (FOA) room impulse responses (RIRs). PINNs have demonstrated promising performance in sound field interpolation by combining the powerful modeling capability of neural networks and the physical principles of sound propagation. In room acoustics, PINNs have typically been trained to represent the sound pressure measured by omnidirectional microphones where the wave equation or its frequency‑domain counterpart, i.e., the Helmholtz equation, is leveraged. Meanwhile, FOA RIRs additionally provide spatial characteristics and are useful for immersive audio generation with a wide range of applications. In this paper, we extend the PINN framework to model FOA RIRs. We derive two physics‑informed priors for FOA RIRs based on the correspondence between the particle velocity and the (X, Y, Z)‑channels of FOA. These priors associate the predicted W‑channel and other channels through their partial derivatives and impose the physically feasible relationship on the four channels. Our experiments confirm the effectiveness of the proposed method compared with a neural network without the physics‑informed prior.

Abstract:
Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non‑intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no‑reference, multi‑domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame‑level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20‑30% reduction in mean squared error and a 4‑5% increase in Kendall's τ versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.

Abstract:
Video‑to‑Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post‑production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self‑distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio‑visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large‑scale V2A dataset, VGGSound.

Abstract:
We propose a novel objective evaluation metric for synthesized audio in text‑to‑audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel‑cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max‑norm used in conventional BERTScore but also on the p‑norm to reflect the non‑local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.

Abstract:
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo‑piano transcriptions. After first pretraining on approximately 60,000 hours of music, we use a comparatively smaller, high‑quality subset, to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general‑purpose contrastive MIDI embeddings by adapting the SimCLR framework to symbolic music. When evaluating piano continuation coherence, our generative model outperforms leading symbolic generation techniques and remains competitive with proprietary audio generation models. On MIR classification benchmarks, frozen representations from our contrastive model achieve state‑of‑the‑art results in linear probe experiments, while direct finetuning demonstrates the generalizability of pretrained representations, often requiring only a few hundred labeled examples to specialize to downstream tasks.

Abstract:
In text‑to‑audio (TTA) research, the relevance between input text and output audio is an important evaluation aspect. Traditionally, it has been evaluated from both subjective and objective perspectives. However, subjective evaluation is costly in terms of money and time, and objective evaluation is unclear regarding the correlation to subjective evaluation scores. In this study, we construct RELATE, an open‑sourced dataset that subjectively evaluates the relevance. Also, we benchmark a model for automatically predicting the subjective evaluation score from synthesized audio. Our model outperforms a conventional CLAPScore model, and that trend extends to many sound categories.

Abstract:
Contrastive language‑audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text‑to‑audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human‑perception‑based CLAP called Human‑CLAP by training a contrastive language‑audio model using the subjective evaluation score. In our experiments, the results indicate that our Human‑CLAP improved the Spearman's rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP.

Abstract:
While end‑to‑end video‑to‑audio generation has greatly improved, producing high‑fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, this generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain‑of‑Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object‑centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state‑of‑the‑art performance in video‑to‑audio generation across both audio metrics and CoT metrics, and excels in the out‑of‑distribution Movie Gen Audio benchmark. The project page is available at https://ThinkSound‑Project.github.io.

Abstract:
We propose Kling‑Foley, a large‑scale multimodal Video‑to‑Audio generation model that synthesizes high‑quality audio synchronized with video content. In Kling‑Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio‑visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio‑visual synchronization. Together with text conditions, this integrated approach enables precise generation of video‑matching sound effects. In addition, we propose a universal latent audio codec that can achieve high‑quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open‑source benchmark, we also open‑source an industrial‑level benchmark Kling‑Audio‑Eval. Our experiments show that Kling‑Foley trained with the flow matching objective achieves new audio‑visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.

Abstract:
Generative AI promises to allow people to create high‑quality personalized media. Although powerful, we identify three fundamental design problems with existing tooling through a literature review. We introduce a multimodal generative AI tool, DeckFlow, to address these problems. First, DeckFlow supports task decomposition by allowing users to maintain multiple interconnected subtasks on an infinite canvas populated by cards connected through visual dataflow affordances. Second, DeckFlow supports a specification decomposition workflow where an initial goal is iteratively decomposed into smaller parts and combined using feature labels and clusters. Finally, DeckFlow supports generative space exploration by generating multiple prompt and output variations, presented in a grid, that can feed back recursively into the next design iteration. We evaluate DeckFlow for text‑to‑image generation against a state‑of‑practice conversational AI baseline for image generation tasks. We then add audio generation and investigate user behaviors in a more open‑ended creative setting with text, image, and audio outputs.

Abstract:
Text‑to‑audio diffusion models produce high‑quality and diverse music but many, if not most, of the SOTA models lack the fine‑grained, time‑varying controls essential for music production. ControlNet enables attaching external controls to a pre‑trained generative model by cloning and fine‑tuning its encoder on new conditionings. However, this approach incurs a large memory footprint and restricts users to a fixed set of controls. We propose a lightweight, modular architecture that considerably reduces parameter count while matching ControlNet in audio quality and condition adherence. Our method offers greater flexibility and significantly lower memory usage, enabling more efficient training and deployment of independent controls. We conduct extensive objective and subjective evaluations and provide numerous audio examples on the accompanying website at https://lightlatentcontrol.github.io

Abstract:
We present SDialog, an MIT‑licensed open‑source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end‑to‑end framework for building and analyzing LLM‑based conversational agents. Built around a standardized Dialog representation, SDialog provides: (1) persona‑driven multi‑agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM‑as‑a‑judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed‑backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog‑centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

Abstract:
Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross‑modal evaluation methodologies.

Abstract:
Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60‑second cinematic movies incorporating Stable Diffusion for high‑fidelity image synthesis, GPT‑2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube‑sourced music. It uses a five‑scene framework, which is augmented by linear frame interpolation, cinematic post‑processing (e.g., sharpening), and audio‑video synchronization to provide professional‑quality results. It was created in a GPU‑accelerated Google Colab environment using Python 3.11. It has a dual‑mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15‑30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text‑to‑video synthesis for creative, educational, and industrial applications.

Abstract:
Generating accurate sounds for complex audio‑visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an \em interactive object‑aware audio generation model that grounds sound generation in user‑selected visual objects within images. Our method integrates object‑centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi‑modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the \em object level. We theoretically validate that our attention mechanism functionally approximates test‑time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/

Abstract:
How does textual representation of audio relate to the Large Language Model's (LLMs) learning about the audio world? This research investigates the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three‑tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases. This suggests that while LLMs possess a latent understanding of the auditory world, their ability to translate this understanding into tangible audio output remains rudimentary. Further research into techniques that can enhance the quality and diversity of LLM‑generated audio can lead to an improvement in the performance of text‑based LLMs in generating audio.

Abstract:
Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight‑line ordinary differential equation (ODE) paths. However, this approach requires training a flow‑matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre‑trained diffusion models, this study integrates pre‑trained models with the rectified diffusion method to improve the efficiency of text‑to‑audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first‑order ODE paths from deterministic noise sample pairs generated by a pre‑trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow‑matching‑based acceleration model.

Abstract:
Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real‑world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large‑scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre‑trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state‑of‑the‑art systems from speech and singing domains.

Abstract:
Diffusion models have been successfully applied in areas such as image, video, and audio generation. Recent works show their promise for sequential decision‑making and dexterous manipulation, leveraging their ability to model complex action distributions. However, challenges persist due to the data limitations and scenario‑specific adaptation needs. In this paper, we address these challenges by proposing an optimized approach to training diffusion policies using large, pre‑built datasets that are enhanced using Reinforcement Learning (RL). Our end‑to‑end pipeline leverages RL‑based enhancement of the DexGraspNet dataset, lightweight diffusion policy training on a dexterous manipulation task for a five‑fingered robotic hand, and a pose sampling algorithm for validation. The pipeline achieved a high success rate of 80% for three DexGraspNet objects. By eliminating manual data collection, our approach lowers barriers to adopting diffusion models in robotics, enhancing generalization and robustness for real‑world applications.

Abstract:
Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human‑aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state‑of‑the‑art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

Abstract:
The rapid advancement of artificial intelligence (AI) has enabled sophisticated audio generation and voice cloning technologies, posing significant security risks for applications reliant on voice authentication. While existing datasets and models primarily focus on distinguishing between human and fully synthetic speech, real‑world attacks often involve audio that combines both genuine and cloned segments. To address this gap, we construct a novel hybrid audio dataset incorporating human, AI‑generated, cloned, and mixed audio samples. We further propose fine‑tuned Audio Spectrogram Transformer (AST)‑based models tailored for detecting these complex acoustic patterns. Extensive experiments demonstrate that our approach significantly outperforms existing baselines in mixed‑audio detection, achieving 97% classification accuracy. Our findings highlight the importance of hybrid datasets and tailored models in advancing the robustness of speech‑based authentication systems.

Abstract:
Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio‑to‑MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general‑purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large‑scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio‑score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of tokens, enabling a single encoder‑decoder Transformer to tackle multiple cross‑modal translation as one coherent sequence‑to‑sequence task. Experimental results confirm that our unified multitask model improves upon single‑task baselines in several key areas, notably reducing the symbol error rate for optical music recognition from 24.58% to a state‑of‑the‑art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score‑image‑conditioned audio generation, marking a significant breakthrough in cross‑modal music generation.

Abstract:
We present Text2midi‑InferAlign, a novel technique for improving symbolic music generation at inference time. Our method leverages text‑to‑audio alignment and music structural alignment rewards during inference to encourage the generated music to be consistent with the input caption. Specifically, we introduce two objectives scores: a text‑audio consistency score that measures rhythmic alignment between the generated music and the original text caption, and a harmonic consistency score that penalizes generated music containing notes inconsistent with the key. By optimizing these alignment‑based objectives during the generation process, our model produces symbolic music that is more closely tied to the input captions, thereby improving the overall quality and coherence of the generated compositions. Our approach can extend any existing autoregressive model without requiring further training or fine‑tuning. We evaluate our work on top of Text2midi ‑ an existing text‑to‑midi generation model, demonstrating significant improvements in both objective and subjective evaluation metrics.

Abstract:
Text‑to‑audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state‑of‑the‑art T2A models still struggle to satisfy human preferences for prompt‑following and acoustic quality when generating complex multi‑event audio. To improve the performance of the model in these high‑level applications, we propose to enhance the basic capabilities of the model with AI feedback learning. First, we introduce fine‑grained AI audio scoring pipelines to: 1) verify whether each event in the text prompt is present in the audio (Event Occurrence Score), 2) detect deviations in event sequences from the language description (Event Sequence Score), and 3) assess the overall acoustic and harmonic quality of the generated audio (Acoustic&Harmonic Quality). We evaluate these three automatic scoring pipelines and find that they correlate significantly better with human preferences than other evaluation metrics. This highlights their value as both feedback signals and evaluation metrics. Utilizing our robust scoring pipelines, we construct a large audio preference dataset, T2A‑FeedBack, which contains 41k prompts and 249k audios, each accompanied by detailed scores. Moreover, we introduce T2A‑EpicBench, a benchmark that focuses on long captions, multi‑events, and story‑telling scenarios, aiming to evaluate the advanced capabilities of T2A models. Finally, we demonstrate how T2A‑FeedBack can enhance current state‑of‑the‑art audio model. With simple preference tuning, the audio generation model exhibits significant improvements in both simple (AudioCaps test set) and complex (T2A‑EpicBench) scenarios.

Abstract:
In recent years, generative adversarial networks (GANs) have made significant progress in generating audio sequences. However, these models typically rely on bandwidth‑limited mel‑spectrograms, which constrain the resolution of generated audio sequences, and lead to mode collapse during conditional generation. To address this issue, we propose Deformable Periodic Network based GAN (DPN‑GAN), a novel GAN architecture that incorporates a kernel‑based periodic ReLU activation function to induce periodic bias in audio generation. This innovative approach enhances the model's ability to capture and reproduce intricate audio patterns. In particular, our proposed model features a DPN module for multi‑resolution generation utilizing deformable convolution operations, allowing for adaptive receptive fields that improve the quality and fidelity of the synthetic audio. Additionally, we enhance the discriminator network using deformable convolution to better distinguish between real and generated samples, further refining the audio quality. We trained two versions of the model: DPN‑GAN small (38.67M parameters) and DPN‑GAN large (124M parameters). For evaluation, we use five different datasets, covering both speech synthesis and music generation tasks, to demonstrate the efficiency of the DPN‑GAN. The experimental results demonstrate that DPN‑GAN delivers superior performance on both out‑of‑distribution and noisy data, showcasing its robustness and adaptability. Trained across various datasets, DPN‑GAN outperforms state‑of‑the‑art GAN architectures on standard evaluation metrics, and exhibits increased robustness in synthesized audio.

Abstract:
Text‑to‑audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic‑Contrastive (ARC) post‑training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post‑training methods have struggled to compare against their expensive distillation counterparts, ARC post‑training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post‑training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post‑training with a number optimizations to Stable Audio Open and build a model capable of generating \approx12s of 44.1kHz stereo audio in \approx75ms on an H100, and \approx7s on a mobile edge‑device, the fastest text‑to‑audio model to our knowledge.

Abstract:
Text‑to‑audio models have recently emerged as a powerful technology for generating sound from textual descriptions. However, their high computational demands raise concerns about energy consumption and environmental impact. In this paper, we conduct an analysis of the energy usage of 7 state‑of‑the‑art text‑to‑audio diffusion‑based generative models, evaluating to what extent variations in generation parameters affect energy consumption at inference time. We also aim to identify an optimal balance between audio quality and energy consumption by considering Pareto‑optimal solutions across all selected models. Our findings provide insights into the trade‑offs between performance and environmental impact, contributing to the development of more efficient generative audio models.

Abstract:
Most widely‑used modern audio codecs, such as Ogg Vorbis and MP3, as well as more recent "neural" codecs like Meta's Encodec or the Descript Audio Codec are based on block‑coding; audio is divided into overlapping, fixed‑size "frames" which are then compressed. While they often yield excellent reproductions and can be used for downstream tasks such as text‑to‑audio, they do not produce an intuitive, directly‑interpretable representation. In this work, we introduce a proof‑of‑concept audio encoder that represents audio as a sparse set of events and their times‑of‑occurrence. Rudimentary physics‑based assumptions are used to model attack and the physical resonance of both the instrument being played and the room in which a performance occurs, hopefully encouraging a sparse, parsimonious, and easy‑to‑interpret representation.

Abstract:
In this work, we address the task of voice conversion (VC) using a vector‑based interface. To align audio embeddings across speakers, we employ discrete optimal transport (OT) and approximate the transport map using the barycentric projection. Our evaluation demonstrates that this approach yields high‑quality and effective voice conversion. We also perform an ablation study on the number of embeddings used, extending previous work on simple averaging of kNN and OT results. Additionally, we show that applying discrete OT as a post‑processing step in audio generation can cause synthetic speech to be misclassified as real, revealing a novel and strong adversarial attack.

Abstract:
Score‑based Generative Models (SGMs) approximate a data distribution by perturbing it with Gaussian noise and subsequently denoising it via a learned reverse diffusion process. These models excel at modeling complex data distributions and generating diverse samples, achieving state‑of‑the‑art performance across domains such as computer vision, audio generation, reinforcement learning, and computational biology. Despite their empirical success, existing Wasserstein‑2 convergence analysis typically assume strong regularity conditions‑such as smoothness or strict log‑concavity of the data distribution‑that are rarely satisfied in practice. In this work, we establish the first non‑asymptotic Wasserstein‑2 convergence guarantees for SGMs targeting semiconvex distributions with potentially discontinuous gradients. Our upper bounds are explicit and sharp in key parameters, achieving optimal dependence of O(\sqrtd) on the data dimension d and convergence rate of order one. The framework accommodates a wide class of practically relevant distributions, including symmetric modified half‑normal distributions, Gaussian mixtures, double‑well potentials, and elastic net potentials. By leveraging semiconvexity without requiring smoothness assumptions on the potential such as differentiability, our results substantially broaden the theoretical foundations of SGMs, bridging the gap between empirical success and rigorous guarantees in non‑smooth, complex data regimes.

Abstract:
Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine‑grained emotions, which often results in machine‑like vocal outputs; and (3) the absence of automated evaluation frameworks that align with human preferences for complex and diverse audio. To address these issues, we propose Dopamine Audiobook, a novel unified training‑free multi‑agent system, where a multimodal large language model (MLLM) serves two specialized roles (i.e., speech designer and audio designer) for emotional, human‑like, and immersive audiobook generation and evaluation. Specifically, we firstly propose a flow‑based, context‑aware framework for diverse audio generation with word‑level semantic and temporal alignment. To enhance expressiveness, we then design word‑level paralinguistic augmentation, utterance‑level prosody retrieval, and adaptive TTS model selection. Finally, for evaluation, we introduce a novel MLLM‑based evaluation framework incorporating self‑critique, perspective‑taking, and psychological MagicEmo prompts to ensure human‑aligned and self‑aligned assessments. Experimental results demonstrate that our method achieves state‑of‑the‑art (SOTA) performance on multiple metrics. Importantly, our evaluation framework shows better alignment with human preferences and transferability across audio tasks.

Abstract:
Recently, neural speech codecs (NSCs) trained as generative models have shown superior performance compared to conventional codecs at low bitrates. Although most state‑of‑the‑art NSCs are trained as Generative Adversarial Networks (GANs), Diffusion Models (DMs), a recent class of generative models, represent a promising alternative due to their superior performance in image generation relative to GANs. Consequently, DMs have been successfully applied for audio and speech coding among various other audio generation applications. However, the design of diffusion‑based NSCs has not yet been explored in a systematic way. We address this by providing a comprehensive analysis of diffusion‑based NSCs divided into three contributions. First, we propose a categorization based on the conditioning and output domains of the DM. This simple conceptual framework allows us to define a design space for diffusion‑based NSCs and to assign a category to existing approaches in the literature. Second, we systematically investigate unexplored designs by creating and evaluating new diffusion‑based NSCs within the conceptual framework. Finally, we compare the proposed models to existing GAN and DM baselines through objective metrics and subjective listening tests.

Abstract:
Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video‑to‑audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video‑and‑text‑to‑audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text‑to‑audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio‑visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.

Abstract:
The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single‑type audio deepfake detection (ADD), their performance declines in cross‑type scenarios. This paper is dedicated to studying the all‑type ADD task. We are the first to comprehensively establish an all‑type ADD benchmark to evaluate current CMs, incorporating cross‑type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self‑supervised learning (PT‑SSL) training paradigm, which optimizes SSL front‑end by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine‑tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)‑SSL method to capture type‑invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all‑type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co‑training. Experimental results demonstrate that WPT‑XLSR‑AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

Abstract:
This paper investigates the design of effective prompt strategies for generating realistic datasets using Text‑To‑Audio (TTA) models. We also analyze different techniques for efficiently combining these datasets to enhance their utility in sound classification tasks. By evaluating two sound classification datasets with two TTA models, we apply a range of prompt strategies. Our findings reveal that task‑specific prompt strategies significantly outperform basic prompt approaches in data generation. Furthermore, merging datasets generated using different TTA models proves to enhance classification performance more effectively than merely increasing the training dataset size. Overall, our results underscore the advantages of these methods as effective data augmentation techniques using synthetic data.

Abstract:
Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra‑low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine‑tuned a pretrained text‑based LLM using Low‑Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ‑VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine‑grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state‑of‑the‑art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Abstract:
Currently, high‑quality, synchronized audio is synthesized using various multi‑modal joint learning frameworks, leveraging video and optional text inputs. In the video‑to‑audio benchmarks, video‑to‑audio quality, semantic alignment, and audio‑visual synchronization are effectively achieved. However, in real‑world scenarios, speech and audio often coexist in videos simultaneously, and the end‑to‑end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end‑to‑end multi‑modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video‑to‑audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video‑to‑audio (V2A) module, a text‑to‑speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end‑to‑end framework achieves state‑of‑the‑art performance on the video‑audio benchmark, video‑speech benchmark, and text‑speech benchmark. In detail, our framework achieves comparable results in the comparison with state‑of‑the‑art models for the video‑audio and text‑speech benchmarks, and surpassing state‑of‑the‑art models in the video‑speech benchmark, with WER 16.57% to 3.15% (+80.99%), SPK‑SIM 78.30% to 89.38% (+14.15%), EMO‑SIM 66.24% to 75.56% (+14.07%), MCD 8.59 to 7.98 (+7.10%), MCD SL 11.05 to 9.40 (+14.93%) across a variety of dubbing settings.

Abstract:
Currently, high‑quality, synchronized audio is synthesized from video and optional text inputs using various multi‑modal joint learning frameworks. However, the precise alignment between the visual and generated audio domains remains far from satisfactory. One key factor is the lack of sufficient temporal and semantic alignment annotations in open‑source video‑audio and text‑audio benchmarks. Therefore, we propose a framework for audio generation from videos, leveraging the internal chain‑of‑thought (CoT) of a multi‑modal large language model (MLLM) to enable step‑by‑step reasoning without requiring additional annotations. Additionally, a corresponding multi‑modal reasoning dataset is constructed to facilitate the learning of initial reasoning in audio generation. In the experiments, we demonstrate the effectiveness of the proposed framework in reducing misalignment (voice‑over) in generated audio and achieving competitive performance compared to various state‑of‑the‑art models. The evaluation results show that the proposed method outperforms state‑of‑the‑art approaches across multiple metrics. Specifically, the F DP aSST indicator is reduced by up to 10.07%, the F DP AN N s indicator by up to 11.62%, and the F DV GG indicator by up to 38.61%. Furthermore, the IS indicator improves by up to 4.95%, the IB‑score indicator increases by up to 6.39%, and the DeSync indicator is reduced by up to 0.89%.

Abstract:
Creating high‑quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step‑by‑step guidance for professional audio generation. However, current state‑of‑the‑art video‑guided audio generation models often fall short of producing high‑quality audio for both general and specialized use cases. To address this challenge, we introduce a multi‑stage, multi‑modal, end‑to‑end generative framework with Chain‑of‑Thought‑like (CoT‑like) guidance learning, termed Chain‑of‑Perform (CoP). First, we employ a transformer‑based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi‑stage training framework that follows step‑by‑step guidance to ensure the generation of high‑quality sound effects. Third, we develop a CoP multi‑modal dataset, guided by video, to support step‑by‑step sound effects generation. Evaluation results highlight the advantages of the proposed multi‑stage CoP generative framework compared to the state‑of‑the‑art models on a variety of datasets, with FAD 0.79 to 0.74 (+6.33%), CLIP 16.12 to 17.70 (+9.80%) on VGGSound, SI‑SDR 1.98dB to 3.35dB (+69.19%), MOS 2.94 to 3.49(+18.71%) on PianoYT‑2h, and SI‑SDR 2.22dB to 3.21dB (+44.59%), MOS 3.07 to 3.42 (+11.40%) on Piano‑10h.

Abstract:
Steganography embeds confidential data within seemingly innocuous communications. Provable security in steganography, a long‑sought goal, has become feasible with deep generative models. However, existing methods face a critical trade‑off between security and efficiency. This paper introduces SparSamp, an efficient provably secure steganography method based on sparse sampling. SparSamp embeds messages by combining them with pseudo‑random numbers to obtain message‑derived random numbers for sampling. It enhances extraction accuracy and embedding capacity by increasing the sampling intervals and making the sampling process sparse. SparSamp preserves the original probability distribution of the generative model, thus ensuring security. It introduces only O(1) additional complexity per sampling step, enabling the fastest embedding speed without compromising generation speed. SparSamp is designed to be plug‑and‑play; message embedding can be achieved by simply replacing the sampling component of an existing generative model with SparSamp. We implemented SparSamp in text, image, and audio generation models. It can achieve embedding speeds of up to 755 bits/second with GPT‑2, 5046 bits/second with DDPM, and 9,223 bits/second with WaveRNN.

Abstract:
Recent works in cross‑modal understanding and generation, notably through models like CLAP (Contrastive Language‑Audio Pretraining) and CAVP (Contrastive Audio‑Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross‑modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross‑modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross‑modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text‑audio generation and retrieval tasks, confirming its effectiveness in enhancing cross‑modal understanding and generation capabilities.

Abstract:
As artificial intelligence‑generated content (AIGC) continues to evolve, video‑to‑audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame‑based features. To address this, we present TA‑V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model‑based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text‑guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video‑to‑audio generation.

Abstract:
Dynamic Magnetic Resonance Imaging (MRI) of the vocal tract has become an increasingly adopted imaging modality for speech motor studies. Beyond image signals, systematic data loss, noise pollution, and audio file corruption can occur due to the unpredictability of the MRI acquisition environment. In such cases, generating audio from images is critical for data recovery in both clinical and research applications. However, this remains challenging due to hardware constraints, acoustic interference, and data corruption. Existing solutions, such as denoising and multi‑stage synthesis methods, face limitations in audio fidelity and generalizability. To address these challenges, we propose a Knowledge Enhanced Conditional Variational Autoencoder (KE‑CVAE), a novel two‑step "knowledge enhancement + variational inference" framework for generating speech audio signals from cine dynamic MRI sequences. This approach introduces two key innovations: (1) integration of unlabeled MRI data for knowledge enhancement, and (2) a variational inference architecture to improve generative modeling capacity. To the best of our knowledge, this is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. The proposed method was trained and evaluated on an open‑source dynamic vocal tract MRI dataset recorded during speech. Experimental results demonstrate its effectiveness in generating natural speech waveforms while addressing MRI‑specific acoustic challenges, outperforming conventional deep learning‑based synthesis approaches.

Abstract:
Generative AI (GenAI) tools enhance social media video creation by streamlining tasks such as scriptwriting, visual and audio generation, and editing. These tools enable the creation of new content, including text, images, audio, and video, with platforms like ChatGPT and MidJourney becoming increasingly popular among YouTube creators. Despite their growing adoption, knowledge of their specific use cases across the video production process remains limited. This study analyzes 274 YouTube how‑to videos to explore GenAI's role in planning, production, editing, and uploading. The findings reveal that YouTubers use GenAI to identify topics, generate scripts, create prompts, and produce visual and audio materials. Additionally, GenAI supports editing tasks like upscaling visuals and reformatting content while also suggesting titles and subtitles. Based on these findings, we discuss future directions for incorporating GenAI to support various video creation tasks.

Abstract:
Text‑to‑audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text‑to‑spatial‑audio (TTSA) generation framework named DualSpec. Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short‑time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial‑aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.

Abstract:
Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution‑free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open‑sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

Abstract:
A diffusion probabilistic model (DPM) is a generative model renowned for its ability to produce high‑quality outputs in tasks such as image and audio generation. However, training DPMs on large, high‑dimensional datasets such as high‑resolution images or audio incurs significant computational, energy, and hardware costs. In this work, we introduce efficient quantum algorithms for implementing DPMs through various quantum ODE solvers. These algorithms highlight the potential of quantum Carleman linearization for diverse mathematical structures, leveraging state‑of‑the‑art quantum linear system solvers (QLSS) or linear combination of Hamiltonian simulations (LCHS). Specifically, we focus on two approaches: DPM‑solver‑k which employs exact k‑th order derivatives to compute a polynomial approximation of ε_θ(x_λ,λ); and UniPC which uses finite difference of ε_θ(x_λ,λ) at different points (x_s_m, λ_s_m) to approximate higher‑order derivatives. As such, this work represents one of the most direct and pragmatic applications of quantum algorithms to large‑scale machine learning models, presumably taking substantial steps towards demonstrating the practical utility of quantum computing.

Abstract:
In multimedia applications such as films and video games, spatial audio techniques are widely employed to enhance user experiences by simulating 3D sound: transforming mono audio into binaural formats. However, this process is often complex and labor‑intensive for sound designers, requiring precise synchronization of audio with the spatial positions of visual components. To address these challenges, we propose a visual‑based spatial audio generation system ‑ an automated system that integrates face detection YOLOv8 for object detection, monocular depth estimation, and spatial audio techniques. Notably, the system operates without requiring additional binaural dataset training. The proposed system is evaluated against existing Spatial Audio generation system using objective metrics. Experimental results demonstrate that our method significantly improves spatial consistency between audio and video, enhances speech quality, and performs robustly in multi‑speaker scenarios. By streamlining the audio‑visual alignment process, the proposed system enables sound engineers to achieve high‑quality results efficiently, making it a valuable tool for professionals in multimedia production.

Abstract:
Audio captioning systems face a fundamental challenge: teacher‑forcing training creates exposure bias that leads to caption degeneration during inference. While contrastive methods have been proposed as solutions, they typically fail to capture the crucial temporal relationships between acoustic and linguistic modalities. We address this limitation by introducing the unbiased sliced Wasserstein RBF (USW‑RBF) kernel with rotary positional embedding, specifically designed to preserve temporal information across modalities. Our approach offers a practical advantage: the kernel enables efficient stochastic gradient optimization, making it computationally feasible for real‑world applications. Building on this foundation, we develop a complete audio captioning framework that integrates stochastic decoding to further mitigate caption degeneration. Extensive experiments on AudioCaps and Clotho datasets demonstrate that our method significantly improves caption quality, lexical diversity, and text‑to‑audio retrieval accuracy. Furthermore, we demonstrate the generalizability of our USW‑RBF kernel by applying it to audio reasoning tasks, where it enhances the reasoning capabilities of large audio language models on the CompA‑R in terms of correctness and quality. Our kernel also improves the reasoning accuracy of the MMAU‑test‑mini benchmarks by 4%. These results establish our approach as a powerful and generalizable solution for cross‑modal alignment challenges in audio‑language tasks.

Abstract:
This paper introduces Swap Forward (SaFa), a modality‑agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi‑views. We first investigate the spectrum aliasing problem in spectrum‑based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel‑spectra and RGB images, we identify that the failure arises from excessive suppression of high‑frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self‑Loop Latent Swap, a frame‑level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high‑frequency components and avoid spectrum distortion. Furthermore, to improve global cross‑view consistency in non‑overlapping regions, we introduce Reference‑Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross‑view similarity‑diversity balance in a forward‑only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training‑based methods in audio generation using both U‑Net and DiT models, along with effective longer length adaptation. It also adapts well to panorama generation, achieving comparable performance with 2 ～ 20 × faster speed and greater model generalizability. More generation demos are available at https://swapforward.github.io/

Abstract:
With the rise of diffusion models, audio‑video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small‑scale datasets. To overcome these limitations, we introduce UniForm, a unified multi‑task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task‑specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video‑to‑audio, audio‑to‑video and text‑to‑audio‑video generation. Furthermore, by leveraging large language models and a large‑scale text‑audio‑video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state‑of‑the‑art single‑task models across three generation tasks, with generated content that is not only highly aligned with real‑world data distributions but also enables more diverse and fine‑grained generation.

Abstract:
Text‑to‑audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text‑to‑audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of AudioGenX in producing faithful explanations, benchmarked against existing methods using novel evaluation metrics specifically designed for audio generation tasks.

Abstract:
Text‑to‑Audio (TTA) generation is an emerging area within AI‑generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well‑labeled datasets and the prevalence of noisy or inaccurate captions in large‑scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality‑aware audio generation. Additionally, we introduce a self‑evolving training strategy that iteratively optimizes CosyAudio across both well‑labeled and weakly‑labeled datasets. Initially trained with well‑labeled data, AudioCapTeller leverages its assessment capabilities on weakly‑labeled datasets for high‑quality filtering and reinforcement learning, which further improves its performance. The well‑trained AudioCapTeller refines corpora by generating new captions and confidence scores, serving for the audio generator training. Extensive experiments on open‑source datasets demonstrate that CosyAudio outperforms existing models in automated audio captioning, generates more faithful audio, and exhibits strong generalization across diverse scenarios.

Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang, Shusen Zhang, Xin Wu, Shuai Zhao, Linchu Xiong, Yozhen Wu, Jiahui Ye, Wenhao Lu, Bowen Li, Yan Zhang, Yaqi Zhou, Xin Chen, Lei Su, Hongda Zhang, Fuzhong Chen, Xuezhen Dong, Na Nie, Zhiying Wu, Bin Xiao, Ting Li, Shunya Dang, Ping Zhang, Yijia Sun, Jincheng Wu, Jinjie Yang, Xionghai Lin, Zhi Ma, Kegeng Wu, Jia li, Aiyuan Yang, Hui Liu, Jianqiang Zhang, Xiaoxi Chen, Guangwei Ai, Wentao Zhang, Yicong Chen, Xiaoqin Huang, Kun Li, Wenjing Luo, Yifei Duan, Lingling Zhu, Ran Xiao, Zhe Su, Jiani Pu, Dian Wang, Xu Jia, Tianyu Zhang, Mengyu Ai, Mang Wang, Yujing Qiao, Lei Zhang, Yanjun Shen, Fan Yang, Miao Zhen, Yijie Zhou, Mingyang Chen, Fei Li, Chenzheng Zhu, Keer Lu, Yaqi Zhao, Hao Liang, Youquan Li, Yanzhao Qin, Linzhuang Sun, Jianhua Xu, Haoze Sun, Mingan Lin, Zenan Zhou, Weipeng Chen

Abstract:
We introduce Baichuan‑Omni‑1.5, an omni‑modal model that not only has omni‑modal understanding capabilities but also provides end‑to‑end audio generation capabilities. To achieve fluent and high‑quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high‑quality data (text, audio, and vision). Second, an audio‑tokenizer (Baichuan‑Audio‑Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi‑stage training strategy that progressively integrates multimodal alignment and multitask fine‑tuning, ensuring effective synergy across all modalities. Baichuan‑Omni‑1.5 leads contemporary models (including GPT4o‑mini and MiniCPM‑o 2.6) in terms of comprehensive omni‑modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2‑VL‑72B across various multimodal medical benchmarks.

Abstract:
Large Language Models (LLMs) demonstrate impressive zero‑shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant safety problems, as models can be exploited to generate harmful or inappropriate content through jailbreak attacks. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio‑specific jailbreak on Large Audio‑Language Models (LALMs) remains largely underexplored. To address this gap, we introduce Jailbreak‑AudioBench, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text‑to‑audio conversion but also various editing techniques for injecting audio hidden semantics. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state‑of‑the‑art LALMs and establish the most comprehensive Jailbreak benchmark to date for audio modality. Finally, Jailbreak‑AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in‑depth exposure of more powerful jailbreak threats, such as query‑based audio editing, and by facilitating the development of effective defense mechanisms.

Abstract:
Head‑related transfer functions (HRTFs) with dense spatial grids are desired for immersive binaural audio generation, but their recording is time‑consuming. Although HRTF spatial upsampling has shown remarkable progress with neural fields, spatial upsampling only from a few measured directions, e.g., 3 or 5 measurements, is still challenging. To tackle this problem, we propose a retrieval‑augmented neural field (RANF). RANF retrieves a subject whose HRTFs are close to those of the target subject from a dataset. The HRTF of the retrieved subject at the desired direction is fed into the neural field in addition to the sound source direction itself. Furthermore, we present a neural network that can efficiently handle multiple retrieved subjects, inspired by a multi‑channel processing technique called transform‑average‑concatenate. Our experiments confirm the benefits of RANF on the SONICOM dataset, and it is a key component in the winning solution of Task 2 of the listener acoustic personalization challenge 2024.

Abstract:
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine‑grained spatial details. In this paper, we propose a new audio‑visual binaural generation model incorporating an audio‑visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost‑efficient way to utilise test‑time augmentation in video data to enhance performance. Our approach achieves state‑of‑the‑art generation accuracy on the FAIR‑Play and MUSIC‑Stereo benchmarks.

Abstract:
While transformers demonstrate outstanding performance across various audio tasks, their application to neural vocoders remains challenging. Neural vocoders require the generation of long audio signals at the sample level, which demands high temporal resolution. This results in significant computational costs for attention map generation and limits their ability to efficiently process both global and local information. Additionally, the sequential nature of sample generation in neural vocoders poses difficulties for real‑time processing, making the direct adoption of transformers impractical. To address these challenges, we propose RingFormer, a neural vocoder that incorporates the ring attention mechanism into a lightweight transformer variant, the convolution‑augmented transformer (Conformer). Ring attention effectively captures local details while integrating global information, making it well‑suited for processing long sequences and enabling real‑time audio generation. RingFormer is trained using adversarial training with two discriminators. The proposed model is applied to the decoder of the text‑to‑speech model VITS and compared with state‑of‑the‑art vocoders such as HiFi‑GAN, iSTFT‑Net, and BigVGAN under identical conditions using various objective and subjective metrics. Experimental results show that RingFormer achieves comparable or superior performance to existing models, particularly excelling in real‑time audio generation. Our code and audio samples are available on GitHub.

Abstract:
We introduce TangoFlux, an efficient Text‑to‑Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold‑standard answers available for Large Language Models (LLMs). To address this, we propose CLAP‑Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state‑of‑the‑art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.

Abstract:
Video‑to‑audio (V2A) generation utilizes visual‑only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine‑grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi‑modal conditions. To overcome these limitations, we introduce Tri‑Ergon, a diffusion‑based V2A model that incorporates textual, auditory, and pixel‑level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real‑world Foley workflows. Tri‑Ergon is capable of creating 44.1 kHz high‑fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state‑of‑the‑art V2A methods that typically generate mono audio for a fixed duration.

Abstract:
We present VoiceDiT, a multi‑modal generative model for producing environment‑aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large‑scale synthetic speech dataset for pre‑training and a refined real‑world speech dataset for fine‑tuning, (2) the Dual‑DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion‑based Image‑to‑Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi‑modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real‑world datasets, showcasing significant improvements in both audio quality and modality integration.

Abstract:
The video‑to‑audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low‑resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth‑Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre‑trained text‑to‑audio generation models. A frame adapter integrates high‑resolution frame‑wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio‑video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth‑Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth‑Foley exhibits higher quality and better adherence to physical laws.

Abstract:
Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi‑modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses during inference and conditioning. In this paper, we introduce SyncFlow, a system that is capable of simultaneously generating temporally synchronized audio and video from text. The core of SyncFlow is the proposed dual‑diffusion‑transformer (d‑DiT) architecture, which enables joint video and audio modelling with proper information fusion. To efficiently manage the computational cost of joint audio and video modelling, SyncFlow utilizes a multi‑stage training strategy that separates video and audio learning before joint fine‑tuning. Our empirical evaluations demonstrate that SyncFlow produces audio and video outputs that are more correlated than baseline methods with significantly enhanced audio quality and audio‑visual correspondence. Moreover, we demonstrate strong zero‑shot capabilities of SyncFlow, including zero‑shot video‑to‑audio generation and adaptation to novel video resolutions without further training.

Abstract:
This work addresses the lack of multimodal generative models capable of producing high‑quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio‑Video Generation (SAVG) task. We introduce a spatially aligned audio‑visual dataset, whose audio and video data are curated based on whether sound events are onscreen or not. We also propose a new alignment metric that aims to evaluate the spatial alignment between audio and video. Then, using the dataset and metric, we benchmark two types of baseline methods: one is based on a joint audio‑video generation model, and the other is a two‑stage method that combines a video generation model and a video‑to‑audio generation model. Our experimental results demonstrate that gaps exist between the baseline methods and the ground truth in terms of video and audio quality, as well as spatial alignment between the two modalities.

Abstract:
Recent advances in audio generation have focused on text‑to‑audio (T2A) and video‑to‑audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off‑screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video‑to‑audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow‑based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual‑Text Encoder and a Joint VT‑SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni‑modal text‑to‑audio and video‑to‑audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe‑Bench, a dataset of 636 video‑text‑audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe‑Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state‑of‑the‑art results on the VGGSound benchmark. Our source code and pre‑trained models will be released. Demo is available at: https://www.youtube.com/watch?v=QmqWhUjPkJI.

Abstract:
Text‑to‑audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel‑spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel‑spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. In this paper, we begin by investigating the critical role of U‑Net in Mel‑spectrogram generation. Our analysis shows that in U‑Net structure, high‑frequency components in skip‑connections and the backbone influence texture and detail, while low‑frequency components in the backbone are critical for the diffusion denoising process. We further propose ``Mel‑Refine'', a plug‑and‑play approach that enhances Mel‑spectrogram texture and detail by adjusting different component weights during inference. Our method requires no additional training or fine‑tuning and is fully compatible with any diffusion‑based TTA architecture. Experimental results show that our approach boosts performance metrics of the latest TTA model Tango2 by 25%, demonstrating its effectiveness.

Abstract:
We present Sketch2Sound, a generative audio model capable of creating high‑quality sounds from a set of interpretable time‑varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound‑shape). Sketch2Sound can be implemented on top of any text‑to‑audio latent diffusion transformer (DiT), and requires only 40k steps of fine‑tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text‑only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation. Sound examples are available at https://hugofloresgarcia.art/sketch2sound/.

Abstract:
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low‑level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non‑autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real‑time and interactive generative applications.

Abstract:
To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the standard solution.However, existing methods offer limited coverage of real‑world scenes and depend on pre‑existing noise libraries and scene metadata.This paper presents prompt‑based Dynamic Generative Scene‑based Noise Addition (DGSNA), a novel approach driven by generative language models that integrates Dynamic Generation of Scene‑based Information (DGSI) with Scene‑based Noise Addition for Speech (SNAS).The DGSI module, with a BET (Background, Examples, Task) prompt framework, dynamically generates logic‑compliant scene‑based information, including scene dimensions, sound sources, and microphone positions, thereby addressing the challenges of scene enumeration and detailed description.Complementing this, the SNAS module employs a Time‑Frequency Diffusion‑based (TFD) Text‑to‑Audio model to synthesize scene‑specific noise. By integrating this noise with clean speech via Room Impulse Response (RIR) filters, the module streamlines the traditionally labor‑intensive process of replicating diverse acoustic environments.Experimental results show that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models, achieving relative improvements of up to 11.32%. Furthermore, DGSNA is highly compatible with existing noise addition techniques. Our implementation and demonstrations are available at https://dgsna.github.io.

Abstract:
This work focuses on improving Text‑To‑Audio (TTA) generation on zero‑shot and few‑shot settings (i.e. generating unseen or uncommon audio events). Inspired by the success of Retrieval‑Augmented Generation (RAG) in Large Language Models, we propose Audiobox TTA‑RAG, a novel retrieval‑augmented TTA approach based on Audiobox, a flow‑matching audio generation model. Unlike the vanilla Audiobox TTA solution that generates audio conditioned on text only, we extend the TTA process by augmenting the conditioning input with both text and retrieved audio samples. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. We show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero‑shot and few‑shot TTA performance, with large margins on multiple evaluation metrics, while maintaining the ability to generate semantically aligned audio for the in‑domain setting.

Abstract:
This paper investigates the capabilities of text‑to‑audio music generation models in producing long‑form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role‑Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text‑to‑music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM‑based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.

Abstract:
This paper presents the NPU‑HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single‑Codec to tokenize the speech into discrete tokens and use a language‑model‑based approach to achieve zero‑shot speaking style cloning. The Single‑Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel‑spectrograms to high‑fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene‑appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively.

Abstract:
Despite significant advancements in neural text‑to‑audio generation, challenges persist in controllability and evaluation. This paper addresses these issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fréchet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation. Our analysis reveals varying performance across sound categories and model architectures, with larger models generally excelling but innovative lightweight approaches also showing promise. The strong correlation between objective metrics and human ratings validates our evaluation approach. We discuss outcomes in terms of audio quality, controllability, and architectural considerations for text‑to‑audio synthesizers, providing direction for future research.

Abstract:
Open‑vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio‑text retrieval using natural language queries. In this paper, for the first time, we perform controlled experiments on various benchmarks to show that existing ALMs struggle to generalize to linguistic variations in textual queries. To address this issue, we propose RobustCLAP, a novel and compute‑efficient technique to learn audio‑language representations agnostic to linguistic variations. Specifically, we reformulate the contrastive loss used in CLAP architectures by introducing a multi‑view contrastive learning objective, where paraphrases are treated as different views of the same audio scene and use this for training. Our proposed approach improves the text‑to‑audio retrieval performance of CLAP by 0.8%‑13% across benchmarks and enhances robustness to linguistic variation.

Authors: Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, Yuming Du

Abstract:
We present Movie Gen, a cast of foundation models that generates high‑quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction‑based video editing and generation of personalized videos based on a user's image. Our models set a new state‑of‑the‑art on multiple tasks: text‑to‑video synthesis, video personalization, video editing, video‑to‑audio generation, and text‑to‑audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames‑per‑second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre‑training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.

Abstract:
Spatial audio is a crucial component in creating immersive experiences. Traditional simulation‑based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end‑to‑end spatial audio generation. We introduce and formulate a new task of generating first‑order Ambisonics (FOA) given a sound category and sound source spatial location. We propose Diff‑SAGe, an end‑to‑end, flow‑based diffusion‑transformer model for this task. Diff‑SAGe utilizes a complex spectrogram representation for FOA, preserving the phase information crucial for accurate spatial cues. Additionally, a multi‑conditional encoder integrates the input conditions into a unified representation, guiding the generation of FOA waveforms from noise. Through extensive evaluations on two datasets, we demonstrate that our method consistently outperforms traditional simulation‑based baselines across both objective and subjective metrics.

Abstract:
Recently, diffusion models have achieved great success in mono‑channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large‑scale, simulation‑based, and GPT‑assisted dataset, BEWO‑1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial‑aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our model not only achieves the objective of generating immersive and controllable spatial audio from text but also extends to other modalities as the pioneer attempt. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real‑world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.

Abstract:
We introduce SRC‑gAudio, a novel audio generation model designed to facilitate text‑to‑audio generation across a wide range of sampling rates within a single model architecture. SRC‑gAudio incorporates the sampling rate as part of the generation condition to guide the diffusion‑based audio generation process. Our model enables the generation of audio at multiple sampling rates with a single unified model. Furthermore, we explore the potential benefits of large‑scale, low‑sampling‑rate data in enhancing the generation quality of high‑sampling‑rate audio. Through extensive experiments, we demonstrate that SRC‑gAudio effectively generates audio under controlled sampling rates. Additionally, our results indicate that pre‑training on low‑sampling‑rate data can lead to significant improvements in audio quality across various metrics.

Abstract:
Artificial Intelligence and generative models have revolutionized music creation, with many models leveraging textual or visual prompts for guidance. However, existing image‑to‑music models are limited to simple images, lacking the capability to generate music from complex digitized artworks. To address this gap, we introduce \mathcalArt2\mathcalMus, a novel model designed to create music from digitized artworks or text inputs. \mathcalArt2\mathcalMus extends the AudioLDM~2 architecture, a text‑to‑audio model, and employs our newly curated datasets, created via ImageBind, which pair digitized artworks with music. Experimental results demonstrate that \mathcalArt2\mathcalMus can generate music that resonates with the input stimuli. These findings suggest promising applications in multimedia art, interactive installations, and AI‑driven creative tools.

Abstract:
We introduce a novel, general‑purpose audio generation framework specifically designed for anomaly detection and localization. Unlike existing datasets that predominantly focus on industrial and machine‑related sounds, our framework focuses a broader range of environments, particularly useful in real‑world scenarios where only audio data are available, such as in video‑derived or telephonic audio. To generate such data, we propose a new method inspired by the LLM‑Modulo framework, which leverages large language models(LLMs) as world models to simulate such real‑world scenarios. This tool is modular allowing a plug‑and‑play approach. It operates by first using LLMs to predict plausible real‑world scenarios. An LLM further extracts the constituent sounds, the order and the way in which these should be merged to create coherent wholes. Much like the LLM‑Modulo framework, we include rigorous verification of each output stage, ensuring the reliability of the generated data. The data produced using the framework serves as a benchmark for anomaly detection applications, potentially enhancing the performance of models trained on audio data, particularly in handling out‑of‑distribution cases. Our contributions thus fill a critical void in audio anomaly detection resources and provide a scalable tool for generating diverse, realistic audio data.

Abstract:
As generative artificial intelligence (AI) continues its ballistic trajectory, everything from text to audio, image, and video generation continues to improve at mimicking human‑generated content. Through a series of perceptual studies, we report on the realism of AI‑generated voices in terms of identity matching and naturalness. We find human participants cannot consistently identify recordings of AI‑generated voices. Specifically, participants perceived the identity of an AI‑voice to be the same as its real counterpart approximately 80% of the time, and correctly identified a voice as AI generated only about 60% of the time.

Abstract:
We introduce Audio‑Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text‑to‑audio (TTA) tasks often make single‑pass inferences from text descriptions. While straightforward, this design struggles to produce high‑quality audio when given complex text conditions. In our method, we utilize a pre‑trained TTA diffusion network as the audio generation agent to work in tandem with GPT‑4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio‑Agent can generate high‑quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable‑length and variable‑volume generation. For video‑to‑audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio, a process that can be tedious and time‑consuming. Instead, we propose a simpler approach by fine‑tuning a pre‑trained Large Language Model (LLM), e.g., Gemma2‑2B‑it, to obtain both semantic and temporal conditions that bridge the video and audio modality. Consequently, our framework contributes a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.

Abstract:
We introduce MDSGen, a novel framework for vision‑guided open‑domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal‑aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource‑heavy Unet‑based models, \textttMDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre‑trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves 97.9% alignment accuracy, using 172× fewer parameters, 371% less memory, and offering 36× faster inference than the current 860M‑parameter state‑of‑the‑art model (93.9% accuracy). The larger model (131M parameters) reaches nearly 99% accuracy while requiring 6.5× fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available at https://bit.ly/mdsgen.

Abstract:
Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete audio tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as Discrete Representation Inconsistency (DRI). This inconsistency can lead to a single audio segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in omissions and repetitions during speech generation. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large‑scale MLS datases (44,000 hours) demonstrate the effectiveness and generality of our method. The demo of audio samples is available online~\footnote\urlhttps://consistencyinneuralcodec.github.io.

Abstract:
Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio‑visual modalities primarily focused on either audio‑visual representation learning or generative modeling of a modality conditioned on the other, creating a disconnect between these two branches. A unified framework that learns representation and generates modalities has not been developed yet. In this work, we introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio‑visual representation learning and vision‑to‑audio generation. The key approach of VAB is that rather than working with raw video frames and audio data, VAB performs representation learning and generative modeling within latent spaces. In particular, VAB uses a pre‑trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively. It then performs the pre‑training task of visual‑conditioned masked audio token prediction. This training strategy enables the model to engage in contextual learning and simultaneous video‑to‑audio generation. After the pre‑training phase, VAB employs the iterative‑decoding approach to rapidly generate audio tokens conditioned on visual features. Since VAB is a unified model, its backbone can be fine‑tuned for various audio‑visual downstream tasks. Our experiments showcase the efficiency of VAB in producing high‑quality audio from video, and its capability to acquire semantic audio‑visual features, leading to competitive results in audio‑visual retrieval and classification.

Abstract:
Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open‑source platform ESPnet‑Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet‑Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet‑Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet‑Codec can be integrated into six ESPnet tasks, supporting diverse applications.

Abstract:
Video‑to‑audio (V2A) generation is important for video editing and post‑processing, enabling the creation of semantics‑aligned audio for silent video. However, most existing methods focus on generating short‑form audio for short video segment (less than 10 seconds), while giving little attention to the scenario of long‑form video inputs. For current UNet‑based diffusion V2A models, an inevitable problem when handling long‑form audio generation is the inconsistencies within the final concatenated audio. In this paper, we first highlight the importance of long‑form V2A problem. Besides, we propose LoVA, a novel model for Long‑form Video‑to‑Audio generation. Based on the Diffusion Transformer (DiT) architecture, LoVA proves to be more effective at generating long‑form audio compared to existing autoregressive models and UNet‑based diffusion models. Extensive objective and subjective experiments demonstrate that LoVA achieves comparable performance on 10‑second V2A benchmark and outperforms all other baselines on a benchmark with long‑form video input.

Abstract:
With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text‑to‑audio (TTA) generation, we first investigate the video‑to‑audio (VTA) generation framework based on latent diffusion model (LDM). Similar to latest pioneering exploration in VTA, our preliminary results also show great potentials of LDM in VTA task, but it still suffers from sub‑optimal temporal alignment. To this end, we propose to enhance the temporal alignment of VTA with frame‑level semantic information. With the recently popular grounding segment anything model (Grounding SAM), we can extract the fine‑grained semantics in video frames to enable VTA to produce better‑aligned audio signal. Extensive experiments demonstrate the effectiveness of our system on both objective and subjective evaluation metrics, which shows both better audio quality and fine‑grained temporal alignment.

Abstract:
Denoising diffusion models have emerged as state‑of‑the‑art in generative tasks across image, audio, and video domains, producing high‑quality, diverse, and contextually relevant data. However, their broader adoption is limited by high computational costs and large memory footprints. Post‑training quantization (PTQ) offers a promising approach to mitigate these challenges by reducing model complexity through low‑bandwidth parameters. Yet, direct application of PTQ to diffusion models can degrade synthesis quality due to accumulated quantization noise across multiple denoising steps, particularly in conditional tasks like text‑to‑audio synthesis. This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models(ADMs). Our key contributions include (1) a coverage‑driven prompt augmentation method and (2) an activation‑aware calibration set generation algorithm for text‑conditional ADMs. These techniques ensure comprehensive coverage of audio aspects and modalities while preserving synthesis fidelity. We validate our approach on TANGO, Make‑An‑Audio, and AudioLDM models for text‑conditional audio generation. Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full‑precision models(<5% increase in FD scores). We show that specific layers in the backbone network can be quantized to 4‑bit weights and 8‑bit activations without significant quality loss. This work paves the way for more efficient deployment of ADMs in resource‑constrained environments.

Abstract:
We introduce V‑AURA, the first autoregressive model to achieve high temporal alignment and relevance in video‑to‑audio generation. V‑AURA uses a high‑framerate visual feature extractor and a cross‑modal audio‑visual feature fusion strategy to capture fine‑grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio‑visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in‑the‑wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V‑AURA outperforms current state‑of‑the‑art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v‑aura.notion.site

Abstract:
Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto‑regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec‑based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution‑based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution‑based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution‑related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero‑shot TTS, particularly in very low bandwidth scenarios.

Abstract:
Current Text‑to‑audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine‑grained control of content and style. Some studies try to improve the granularity by incorporating additional frame‑level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement for reference frame‑level conditions. To address these challenges, we propose AudioComposer, a novel TTA generation framework that relies solely on natural language descriptions (NLDs) to provide both content specification and style control information. To further enhance audio generative modeling, we employ flow‑based diffusion transformers with the cross‑attention mechanism to incorporate text descriptions effectively into audio generation processes, which can not only simultaneously consider the content and style information in the text inputs, but also accelerate generation compared to other architectures. Furthermore, we propose a novel and comprehensive automatic data simulation pipeline to construct data with fine‑grained text descriptions, which significantly alleviates the problem of data scarcity in the area. Experiments demonstrate the effectiveness of our framework using solely NLDs as inputs for content specification and style control. The generation quality and controllability surpass state‑of‑the‑art TTA models, even with a smaller model size.

Abstract:
Music‑text multimodal systems have enabled new approaches to Music Information Research (MIR) applications such as audio‑to‑text and text‑to‑audio retrieval, text‑based song generation, and music captioning. Despite the reported success, little effort has been put into evaluating the musical knowledge of Large Language Models (LLM). In this paper, we demonstrate that LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g. 'rock song without guitar'), and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet‑based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged the Audioset ontology to generate triplets consisting of an anchor, a positive (relevant) label, and a negative (less relevant) label for the genre and instruments sub‑tree. We evaluated the triplet‑based musical knowledge for six general‑purpose Transformer‑based models. The triplets obtained through this methodology required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off‑the‑shelf LLMs need adaptation to music before use.

Abstract:
Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State‑of‑the‑art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non‑spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware‑audio and text embedding model trained using multimodal contrastive learning. ELSA supports non‑spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open‑source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non‑spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state‑of‑the‑art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio‑to‑text and text‑to‑audio R@1 above the baseline, and outperforms by ‑11.6° mean‑absolute‑error in 3D source localization over the baseline.

Abstract:
Video‑to‑audio (V2A) generation aims to produce corresponding audio given silent video inputs. This task is particularly challenging due to the cross‑modality and sequential nature of the audio‑visual features involved. Recent works have made significant progress in bridging the domain gap between video and audio, generating audio that is semantically aligned with the video content. However, a critical limitation of these approaches is their inability to effectively recognize and handle multiple scenes within a video, often leading to suboptimal audio generation in such cases. In this paper, we first reimplement a state‑of‑the‑art V2A model with a slightly modified light‑weight architecture, achieving results that outperform the baseline. We then propose an improved V2A model that incorporates a scene detector to address the challenge of switching between multiple visual scenes. Results on VGGSound show that our model can recognize and handle multiple scenes within a video and achieve superior performance against the baseline for both fidelity and relevance.

Abstract:
Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi‑style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross‑attention between text and reference audio for adaptive style control. Adaptive layer normalization is then utilized to enhance the model's capacity to express multiple styles. Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual‑prompt audio generation using both text and audio references. Experimental results demonstrate the robustness of the model, achieving state‑of‑the‑art Fréchet Distance of 26.94 and KL Divergence of 1.82, surpassing Tango, AudioLDM, and AudioGen. Furthermore, the generated audio shows high similarity to its corresponding audio reference. The demo, code, and dataset are publicly available.

Abstract:
In this paper, we introduce SoloAudio, a novel diffusion‑based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U‑Net backbone with a skip‑connected Transformer that operates on latent features. SoloAudio supports both audio‑oriented and language‑oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state‑of‑the‑art text‑to‑audio models for training, demonstrating strong generalization to out‑of‑domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state‑of‑the‑art results on both in‑domain and out‑of‑domain data, and exhibits impressive zero‑shot and few‑shot capabilities. Source code and demos are released.

Abstract:
We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross‑attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi‑GAN is used to reconstruct the audio from the tokens. By applying a cross‑entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

Abstract:
This paper proposes a novel framework for audio deepfake detection with two main objectives: i) attaining the highest possible accuracy on available fake data, and ii) effectively performing continuous learning on new fake data in a few‑shot learning manner. Specifically, we conduct a large audio deepfake collection using various deep audio generation methods. The data is further enhanced with additional augmentation methods to increase variations amidst compressions, far‑field recordings, noise, and other distortions. We then adopt the Audio Spectrogram Transformer for the audio deepfake detection model. Accordingly, the proposed method achieves promising performance on various benchmark datasets. Furthermore, we present a continuous learning plugin module to update the trained model most effectively with the fewest possible labeled data points of the new fake type. The proposed method outperforms the conventional direct fine‑tuning approach with much fewer labeled data points.

Abstract:
This paper introduces MetaBGM, a groundbreaking framework for generating background music that adapts to dynamic scenes and real‑time user interactions. We define multi‑scene as variations in environmental contexts, such as transitions in game settings or movie scenes. To tackle the challenge of converting backend data into music description texts for audio generation models, MetaBGM employs a novel two‑stage generation approach that transforms continuous scene and user state data into these texts, which are then fed into an audio generation model for real‑time soundtrack creation. Experimental results demonstrate that MetaBGM effectively generates contextually relevant and dynamic background music for interactive applications.

Abstract:
In recent years, artificial intelligence (AI) has made significant progress in the field of music generation, driving innovation in music creation and applications. This paper provides a systematic review of the latest research advancements in AI music generation, covering key technologies, models, datasets, evaluation methods, and their practical applications across various fields. The main contributions of this review include: (1) presenting a comprehensive summary framework that systematically categorizes and compares different technological approaches, including symbolic generation, audio generation, and hybrid models, helping readers better understand the full spectrum of technologies in the field; (2) offering an extensive survey of current literature, covering emerging topics such as multimodal datasets and emotion expression evaluation, providing a broad reference for related research; (3) conducting a detailed analysis of the practical impact of AI music generation in various application domains, particularly in real‑time interaction and interdisciplinary applications, offering new perspectives and insights; (4) summarizing the existing challenges and limitations of music quality evaluation methods and proposing potential future research directions, aiming to promote the standardization and broader adoption of evaluation techniques. Through these innovative summaries and analyses, this paper serves as a comprehensive reference tool for researchers and practitioners in AI music generation, while also outlining future directions for the field.

Abstract:
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text‑to‑video and text‑to‑audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text‑to‑audio retrieval. In particular, we dissect the temporal understanding capabilities of a state‑of‑the‑art model for text‑to‑audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text‑audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text‑audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio‑retrieval/dtu/.