arXiv Papers of Audio/Speech/Music Editing
Authors: Xun Gong, Jinchuan Tian, Haoran Wang, William Chen, Shinji Watanabe, Yanmin Qian
Abstract: Current text‑guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper‑Edit to enable open‑ended audio editing via free‑form natural language instructions. We reformulate audio editing as a rich‑caption rewriting task by treating a rich caption as the semantic representation of an audio clip. The user request is translated into an edited caption, which then guides Bagpiper‑Edit to generate the target edited audio with the original audio as contextual acoustic anchor. This unlocks the potential of free‑form editing, and circumvents the need for paired audio‑editing training data, enabling powerful zero‑shot editing capabilities. Evaluations across speech, audio, and free‑form editing show Bagpiper‑Edit maintains good consistency to the original audio and achieves similar performance to other expert models in most cases. Demo: https://bagpiper‑edit.github.io, Codes: https://github.com/espnet/espnet/pull/6417 & https://github.com/HsunGong/espnet
Authors: Dmitrii Gavrilev
Abstract: Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable‑length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: https://realfolkcode.github.io/pianokontext_demo/.
Authors: Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen
Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general‑purpose instruction‑based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano‑banana 2 for images and Gemini‑Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real‑world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi‑hop reasoning and multi‑round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human‑agent collaboration, MMAE comprises 2,000 high‑fidelity samples paired with a pioneering rubric‑based evaluation framework. By decomposing free‑form tasks into 17,741 verifiable criteria, this robust rubric‑based paradigm enables a precise, multi‑dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed‑modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long‑lasting evaluation paradigm for next‑generation audio editing systems.
Authors: Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yongchang Gan, Yong Qin
Abstract: Speech editing and zero‑shot Text‑to‑Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine‑Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse‑grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two‑stage post‑training framework that progresses from supervised editing initialization to editing‑oriented Group Relative Policy Optimization (GRPO) over target‑speech‑free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero‑shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.
Authors: Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
Abstract: Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable‑length audio generation and editing. Since our models can generate several minutes of audio, variable‑length generations are key to avoid the cost of producing full‑length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic‑acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion‑based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post‑training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer‑grade hardware, together with their training and inference pipeline.
Authors: Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lyu, Wei Xue, Yike Guo
Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio‑Omni, the first end‑to‑end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi‑modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high‑level reasoning with a trainable Diffusion Transformer for high‑fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large‑scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio‑Omni achieves state‑of‑the‑art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio‑Omni exhibits remarkable inherited capabilities, including knowledge‑augmented reasoning generation, in‑context generation, and zero‑shot cross‑lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio‑Omni.
Authors: Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
Abstract: Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label‑based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio‑based methods can leverage emotionally rich speech signals ‑ and even benefit from expressive text‑to‑speech (TTS) synthesis ‑ but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images‑based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high‑quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross‑Modal Emotion Transfer (C‑MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C‑MET leverages a large‑scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA‑D datasets demonstrate that our method improves emotion accuracy by 14% over state‑of‑the‑art methods, while generating expressive talking face videos ‑ even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok‑choi.github.io/C‑MET/
Authors: Xiaobin Rong, Yushi Wang, Zheng Wang, Jing Lu
Abstract: We introduce GAP‑URGENet, a generative‑predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full‑stack speech restoration in a self‑supervised representation domain and reconstructs the waveform via a neural vocoder, along with a predictive branch that performs spectrogram‑domain enhancement, providing complementary cues. Outputs from both branches are fused by a post‑processing module, which also performs bandwidth extension to generate the enhanced waveform at 48 kHz, later downsampled to the original sampling rate. This generative‑predictive fusion improves robustness and perceptual quality, achieving top performance in the blind‑test phase and ranking 1st in the objective evaluation. Audio examples are available at https://xiaobin‑rong.github.io/gap‑urgenet_demo.
Authors: Ziqi Liang, Zhijun Jia, Chang Liu, Minghui Yang, Zhihong Lu, Jian Wang
Abstract: Previous speech restoration (SR) primarily focuses on single‑task speech restoration (SSR), which cannot address general speech restoration problems. Training specific SSR models for different distortions is time‑consuming and lacks generality. In addition, most studies ignore the problem of model generalization across unseen domains. To overcome those limitations, we propose DisSR, a Disentangling Speech Representation based general speech restoration model with two properties: 1) Degradation‑prior guidance, which extracts speaker‑invariant degradation representation to guide the diffusion‑based speech restoration model. 2) Domain adaptation, where we design cross‑domain alignment training to enhance the model's adaptability and generalization on cross‑domain data, respectively. Experimental results demonstrate that our method can produce high‑quality restored speech under various distortion conditions. Audio samples can be found at https://itspsp.github.io/DisSR.
Authors: Hanchen Pei, Shujie Liu, Yanqing Liu, Jianwei Yu, Yuanhang Qian, Gongping Huang, Sheng Zhao, Yan Lu
Abstract: Neural codec language models achieve impressive zero‑shot Text‑to‑Speech (TTS) by fully imitating the acoustic characteristics of a short speech prompt, including timbre, prosody, and paralinguistic information. However, such holistic imitation limits their ability to isolate and control individual attributes. In this paper, we present a unified codec language model SpeechEdit that extends zero‑shot TTS with a selective control mechanism. By default, SpeechEdit reproduces the complete acoustic profile inferred from the speech prompt, but it selectively overrides only the attributes specified by explicit control instructions. To enable controllable modeling, SpeechEdit is trained on our newly constructed LibriEdit dataset, which provides delta (difference‑aware) training pairs derived from LibriHeavy. Experimental results show that our approach maintains naturalness and robustness while offering flexible and localized control over desired attributes. Audio samples are available at https://speech‑editing.github.io/speech‑editing/.
Authors: Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, Yu Li
Abstract: We present the LEMAS‑Dataset, which, to our knowledge, is currently the largest open‑source multilingual speech corpus with word‑level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS‑Dataset is constructed via a efficient data processing pipeline that ensures high‑quality data and annotations. To validate the effectiveness of LEMAS‑Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS‑TTS, built upon a non‑autoregressive flow‑matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero‑shot multilingual synthesis. Our proposed accent‑adversarial training and CTC loss mitigate cross‑lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS‑Edit employs an autoregressive decoder‑only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word‑level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth‑boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS‑Dataset deliver high‑quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp‑annotated, fine‑grained multilingual corpus will drive future advances in prompt‑based speech generation systems.
Authors: Core Team, Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xingchen Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qiying Wang, Sirui Deng, Weimin Xiong, Wenshan Huang, Wenyu Yang, Yilin Jiang, Yixin Yang, Yuanyuan Tian, Yue Ma, Yue Yu, Zihan Zhang, Zihao Yue, Bangjun Xiao, Bingquan Xia, Bofei Gao, Bowen Ye, Can Cai, Chang Liu, Chenhong He, Chunan Li, Dawei Zhu, Duo Zhang, Fengyuan Shi, Guoan Wang, Hailin Zhang, Hanglong Lv, Hanyu Li, Hao Tian, Heng Qu, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianguang Zuo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Linghao Zhang, Meng Chen, Nuo Chen, Peng Zhang, Qianli Chen, Qiantong Wang, Rang Li, Shaohui Liu, Shengfan Wang, Shicheng Li, Shihua Yu, Shijie Cao, Shimao Chen, Shuhao Gu, Weikun Wang, Wenhan Ma, Xiangwei Deng, Xing Yong, Xing Zhang, Xu Wang, Yifan Song, Yihao Zhao, Yingbo Zhao, Yizhao Gao, Yu Cheng, Yu Tu, Yudong Wang, Zhaojun Huang, Zhengju Tang, Zhenru Lin, Zhichao Song, Zhipeng Xu, Zhixian Zheng, Zihan Jiang
Abstract: Existing audio language models typically rely on task‑specific fine‑tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT‑3 has shown that scaling next‑token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo‑Audio's pretraining data to over one hundred million of hours, we observe the emergence of few‑shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo‑Audio‑7B‑Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open‑source models. Beyond standard metrics, MiMo‑Audio‑7B‑Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo‑Audio‑7B‑Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post‑training stage, we curate a diverse instruction‑tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo‑Audio‑7B‑Instruct achieves open‑source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU‑Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct‑TTS evaluations, approaching or surpassing closed‑source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo‑Audio.
Authors: Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
Abstract: We introduce a novel pipeline for joint audio‑visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state‑of‑the‑art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video‑to‑audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio‑visual alignment and content integrity.
Authors: Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Yinghao Liu, Zheng Xue, Gang Song, Boyang Zhou
Abstract: Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose UniTok‑Audio, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok‑Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual‑stream audio codec involving acoustic and semantic branch is developed for high‑fidelity waveform reconstruction. Experimental results demonstrate that UniTok‑Audio achieves competitive performance in comparation with state‑of‑the‑art task‑specific or multi‑task systems across five time‑aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language‑queried audio source separation. To foster future research, we will open‑source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified‑audio.
Authors: Haoyin Yan, Chengwei Liu, Shaofei Xue, Xiaotao Liang, Zheng Xue
Abstract: The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub‑tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder‑only LM‑based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.
Authors: Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen
Abstract: Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text‑to‑speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high‑quality generation. To mitigate the inherent divergence between autoregressive and flow‑matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text‑prefix‑conditioned speech infilling method enables high‑fidelity zero‑shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single‑task modeling methods in both ASR and zero‑shot TTS tasks. This work explores new possibilities for end‑to‑end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.
Authors: Jiaqi Li, Yao Qian, Yuxuan Hu, Leying Zhang, Xiaofei Wang, Heng Lu, Manthan Thakker, Jinyu Li, Sheng Zhao, Zhizheng Wu
Abstract: Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low‑frame‑rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature‑assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information‑sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference‑time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model‑based TTS. Demos are available at: https://flexicodec.github.io. Code is available at: https://github.com/amphionteam/flexicodec.
Authors: Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu
Abstract: Bridge models have been investigated in speech enhancement but are mostly single‑task, with constrained general speech restoration (GSR) capability. In this work, we propose VoiceBridge, a one‑step latent bridge model (LBM) for GSR, capable of efficiently reconstructing 48 kHz fullband speech from diverse distortions. To inherit the advantages of data‑domain bridge models, we design an energy‑preserving variational autoencoder, enhancing the waveform‑latent space alignment over varying energy levels. By compressing waveform into continuous latent representations, VoiceBridge models~various GSR tasks with a~single latent‑to‑latent generative process backed by a scalable transformer. To alleviate the challenge of reconstructing the high‑quality target from distinctively different low‑quality priors, we propose a joint neural prior for GSR, uniformly reducing the burden of the LBM in diverse tasks. Building upon these designs, we further investigate bridge training objective by jointly tuning LBM, decoder and discriminator together, transforming the model from a denoiser to generator and enabling one‑step GSR without distillation. Extensive validation across in‑domain (e.g., denoising and super‑resolution) and out‑of‑domain tasks (e.g., refining synthesized speech) and datasets demonstrates the superior performance of VoiceBridge. Demos: https://VoiceBridgedemo.github.io/.
Authors: Yushen Chen, Kai Hu, Long Zhou, Shulin Feng, Xusheng Yang, Hangting Chen, Xie Chen
Abstract: We propose AUV, a unified neural audio codec with a single codebook, which enables a favourable reconstruction of speech and further extends to general audio, including vocal, music, and sound. AUV is capable of tackling any 16 kHz mixed‑domain audio segment at bit rates around 700 bps. To accomplish this, we guide the matryoshka codebook with nested domain‑specific partitions, assigned with corresponding teacher models to perform distillation, all in a single‑stage training. A conformer‑style encoder‑decoder architecture with STFT features as audio representation is employed, yielding better audio quality. Comprehensive evaluations demonstrate that AUV exhibits comparable audio reconstruction ability to state‑of‑the‑art domain‑specific single‑layer quantizer codecs, showcasing the potential of audio universal vector quantization with a single codebook. The pre‑trained model and demo samples are available at https://swivid.github.io/AUV/.
Authors: Zitong Lan, Yiduo Hao, Mingmin Zhao
Abstract: Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template‑like instruction formats and are restricted to mono‑channel audio. These models fail to deal with declarative audio editing, where the user declares what the desired outcome should be, while leaving the details of editing operations to the system. We introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capability of audio language models with the generative power of latent diffusion. Given a high‑level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating events. These operations are then executed by a diffusion model trained to manipulate stereo audio. To support this, we design a data synthesis pipeline that produces paired examples of high‑level instructions, atomic edit operations, and audios before and after each edit operation. Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods. Demos are available at https://zitonglan.github.io/project/smartdj/smartdj.html.
Authors: Yuhang Jia, Xu Zhang, Yang Chen, Hui Wang, Enzhi Wang, Yong Qin
Abstract: Automatic mean opinion score (MOS) prediction provides a more perceptual alternative to objective metrics, offering deeper insights into the evaluated models. With the rapid progress of multimodal large language models (MLLMs), their enhanced perceptual and reasoning abilities enable more comprehensive and interpretable audio quality assessment. In this work, we tackle the challenging task of audio editing evaluation and propose the first natural language‑based automated evaluation framework built on MLLMs. Our approach introduces two fine‑tuning tasks to boost multi‑audio understanding, combined with Chain‑of‑Thought prompting, and lightweight instruction tuning, to enhance step‑by‑step reasoning. Experiment demonstrate that our framework delivers accurate, interpretable, and text‑based editing evaluation, closely aligning with human judgments and objective metrics while substantially improving over baselines. The code and demo are available at https://github.com/NKU‑HLT/Eval_Reasoning.
Authors: Sean Turland, Eloi Moliner, Vesa Välimäki
Abstract: Music inpainting aims to reconstruct missing segments of a corrupted recording. While diffusion‑based generative models improve reconstruction for medium‑length gaps, they often struggle to preserve musical plausibility over multi‑second gaps. We introduce Similarity‑Guided Diffusion Posterior Sampling (SimDPS), a hybrid method that combines diffusion‑based inference with similarity search. Candidate segments are first retrieved from a corpus based on contextual similarity, then incorporated into a modified likelihood that guides the diffusion process toward contextually consistent reconstructions. Subjective evaluation on piano music inpainting with 2‑s gaps shows that the proposed SimDPS method enhances perceptual plausibility compared to unguided diffusion and frequently outperforms similarity search alone when moderately similar candidates are available. These results demonstrate the potential of a hybrid similarity approach for diffusion‑based audio enhancement with long gaps.
Authors: Xueping Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li
Abstract: Component‑level audio Spoofing (Comp‑Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti‑spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component‑level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation‑enhanced joint learning framework that separates audio components apart and applies anti‑spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.
Authors: Junan Zhang, Mengyao Zhu, Xin Xu, Hui Bu, Zhenhua Ling, Zhizheng Wu
Abstract: Real‑world speech communication is rarely affected by a single type of degradation. Instead, it suffers from a complex interplay of acoustic interference, codec compression, and, increasingly, secondary artifacts introduced by upstream enhancement algorithms. To bridge the gap between academic research and these realistic scenarios, we introduced the CCF AATC 2025 Challenge. This challenge targets universal blind speech restoration, requiring a single model to handle three distinct distortion categories: acoustic degradation, codec distortion, and secondary processing artifacts. In this paper, we provide a comprehensive retrospective of the challenge, detailing the dataset construction, task design, and a systematic analysis of the 25 participating systems. We report three key findings that define the current state of the field: (1) Efficiency vs. Scale: Contrary to the trend of massive generative models, top‑performing systems demonstrated that lightweight discriminative architectures (<10M parameters) can achieve state‑of‑the‑art performance, balancing restoration quality with deployment constraints. (2) Generative Trade‑off: While generative and hybrid models excel in theoretical perceptual metrics, breakdown analysis reveals they suffer from "reconstruction bias" in high‑SNR codec tasks and struggle with hallucination in complex secondary artifact scenarios. (3) Metric Gap: Most critically, our rank correlation analysis exposes a strong negative correlation (\rho=‑0.8) between widely‑used reference‑free metrics (e.g., DNSMOS) and human MOS when evaluating hybrid systems. This indicates that current metrics may over‑reward artificial spectral smoothness at the expense of perceptual naturalness. This paper aims to serve as a reference for future research in robust speech restoration and calls for the development of next‑generation evaluation metrics sensitive to generative artifacts.
Authors: Christian Zhou-Zheng, John Backsund, Dun Li Chan, Alex Coventry, Avid Eslami, Jyotin Goel, Xingwen Han, Danysh Soomro, Galen Wei
Abstract: We present a traditional approach to symbolic piano music continuation for the MIREX 2025 Symbolic Music Generation challenge. While computational music generation has recently focused on developing large foundation models with sophisticated architectural modifications, we argue that simpler approaches remain more effective for constrained, single‑instrument tasks. We thus return to a simple, unaugmented next‑token‑prediction objective on tokenized raw MIDI, aiming to outperform large foundation models by using better data and better fundamentals. We release model weights and code at https://github.com/christianazinn/mirex2025.
Authors: Yuhang Jia, Hui Wang, Xin Nie, Yujie Guo, Lianru Gao, Yong Qin
Abstract: Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high‑quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset construction processes: 1) First, we establish AuditScore, the first comprehensive dataset for subjective evaluation of audio editing, consisting of over 6,300 edited samples generated from 7 representative audio editing frameworks and 23 system configurations. Each sample is annotated by professional raters on three key aspects of audio editing quality: overall Quality, Relevance to editing intent, and Faithfulness to original features. 2) Based on this dataset, we systematically propose AuditEval, a family of automatic MOS‑style evaluators tailored for audio editing, covering both SSL‑based and LLM‑based approaches. It addresses the lack of effective objective metrics and the prohibitive cost of subjective evaluation in this field. 3) We further leverage AuditEval to evaluate and filter a large amount of synthetically mixed editing pairs, mining a high‑quality pseudo‑parallel subset by selecting the most plausible samples. Comprehensive experiments validate that our expert‑informed filtering strategy effectively yields higher‑quality data, while also exposing the limitations of traditional objective metrics and the advantages of AuditEval. The dataset, codes and tools can be found at: https://github.com/NKU‑HLT/AuditEval.
Authors: Mordehay Moradi, Sharon Gannot
Abstract: In this paper, we present PGDI, a diffusion‑based speech inpainting framework for restoring missing or severely corrupted speech segments. Unlike previous methods that struggle with speaker variability or long gap lengths, PGDI can accurately reconstruct gaps of up to one second in length while preserving speaker identity, prosody, and environmental factors such as reverberation. Central to this approach is classifier guidance, specifically phoneme‑level guidance, which substantially improves reconstruction fidelity. PGDI operates in a speaker‑independent manner and maintains robustness even when long segments are completely masked by strong transient noise, making it well‑suited for real‑world applications, such as fireworks, door slams, hammer strikes, and construction noise. Through extensive experiments across diverse speakers and gap lengths, we demonstrate PGDI's superior inpainting performance and its ability to handle challenging acoustic conditions. We consider both scenarios, with and without access to the transcript during inference, showing that while the availability of text further enhances performance, the model remains effective even in its absence. For audio samples, visit: https://mordehaym.github.io/PGDI/
Authors: Vassilis Sioros, Alexandros Potamianos, Giorgos Paraskevopoulos
Abstract: In this study, we investigate leveraging cross‑attention control for efficient audio editing within auto‑regressive models. Inspired by image editing methodologies, we develop a Prompt‑to‑Prompt‑like approach that guides edits through cross and self‑attention mechanisms. Integrating a diffusion‑based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt‑guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre‑trained frozen auto‑regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly‑used music‑specific evaluation metrics and a human study, to gauge time‑varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt‑to‑prompt guidance with autoregressive generation models significantly outperforms the diffusion‑based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen
Authors: Alexander Fichtinger, Jan Schlüter, Gerhard Widmer
Abstract: Generative models of music audio are typically used to generate output based solely on a text prompt or melody. Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model. In this work, we explore its application in the audio domain as a tool for data augmentation or content manipulation. Specifically, implementing Boomerang sampling for Stable Audio Open, we augment training data for a state‑of‑the‑art beat tracker, and attempt to replace musical instruments in recordings. Our results show that the rhythmic structure of existing examples is mostly preserved, that it improves performance of the beat tracker, but only in scenarios of limited training data, and that it can accomplish text‑based instrument replacement on monophonic inputs. We publish our implementation to invite experiments on data augmentation in other tasks and explore further applications.
Authors: Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu
Abstract: Speech codecs serve as bridges between speech signals and large language models. An ideal codec for speech language models should not only preserve acoustic information but also capture rich semantic information. However, existing speech codecs struggle to balance high‑quality audio reconstruction with ease of modeling by language models. In this study, we analyze the limitations of previous codecs in balancing semantic richness and acoustic fidelity. We propose XY‑Tokenizer, a novel codec that mitigates the conflict between semantic and acoustic capabilities through multi‑stage, multi‑task learning. Experimental results demonstrate that XY‑Tokenizer achieves performance in both semantic and acoustic tasks comparable to that of state‑of‑the‑art codecs operating at similar bitrates, even though those existing codecs typically excel in only one aspect. Specifically, XY‑Tokenizer achieves strong text alignment, surpassing distillation‑based semantic modeling methods such as SpeechTokenizer and Mimi, while maintaining a speaker similarity score of 0.83 between reconstructed and original audio. The reconstruction performance of XY‑Tokenizer is comparable to that of BigCodec, the current state‑of‑the‑art among acoustic‑only codecs, which achieves a speaker similarity score of 0.84 at a similar bitrate. Code and models are available at https://github.com/gyt1145028706/XY‑Tokenizer.
Authors: Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang
Abstract: We propose MuseControlLite, a lightweight mechanism designed to fine‑tune text‑to‑music generation models for precise conditioning using various time‑varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text‑to‑music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross‑attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state‑of‑the‑art fine‑tuning mechanisms, using the same pre‑trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen‑Large and Stable Audio Open ControlNet at a significantly lower fine‑tuning cost, with only 85M trainble parameters. Source code, model checkpoints, and demo examples are available at: https://musecontrollite.github.io/web/.
Authors: Christian Zhou-Zheng, Philippe Pasquier
Abstract: Existing work in automatic music generation has mostly focused on end‑to‑end systems that generate either entire compositions or continuations of pieces, which are difficult for composers to iterate on. The area of computer‑assisted composition, where generative models integrate into existing creative workflows, remains comparatively underexplored. In this study, we address the tasks of model style adaptation and multi‑track, long‑context, and controllable symbolic music infilling to enhance the process of computer‑assisted composition. We present MIDI‑RWKV, a small foundation model based on the RWKV‑7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI‑RWKV admits an effective method of finetuning its initial state for style adaptation in the very‑low‑sample regime. We evaluate MIDI‑RWKV and its state tuning on several quantitative and qualitative metrics with respect to existing models, and release model weights and code at https://github.com/christianazinn/MIDI‑RWKV.
Authors: Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li
Abstract: Text‑based speech editing (TSE) modifies speech using only text, eliminating re‑recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post‑correction scheme for TSE. EmoCorrector leverages Retrieval‑Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD‑TSE). The prominent aspect of ECD‑TSE is its inclusion of <text, speech> paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD‑TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI‑S2‑Lab/EmoCorrector.
Authors: Wenda Chu, Zihui Wu, Yifan Chen, Yang Song, Yisong Yue
Abstract: We study the problem of posterior sampling in discrete‑state spaces using discrete diffusion models. While posterior sampling methods for continuous diffusion models have achieved remarkable progress, analogous methods for discrete diffusion models remain challenging. In this work, we introduce a principled plug‑and‑play discrete diffusion posterior sampling algorithm based on split Gibbs sampling, which we call SGDD. Our algorithm enables reward‑guided generation and solving inverse problems in discrete‑state spaces. We demonstrate the convergence of SGDD to the target posterior distribution and verify this through controlled experiments on synthetic benchmarks. Our method enjoys state‑of‑the‑art posterior sampling performance on a range of benchmarks for discrete data, including DNA sequence design, discrete image inverse problems, and music infilling, achieving more than 30% improved performance compared to existing baselines. Our code is available at https://github.com/chuwd19/Split‑Gibbs‑Discrete‑Diffusion‑Posterior‑Sampling.
Authors: Andong Li, Zhihang Sun, Fengyuan Hao, Xiaodong Li, Chengshi Zheng
Abstract: Speech enhancement (SE) and neural vocoding are traditionally viewed as separate tasks. In this work, we observe them under a common thread: the rank behavior of these processes. This observation prompts two key questions: Can a model designed for one task's rank degradation be adapted for the other? and Is it possible to address both tasks using a unified model? Our empirical findings demonstrate that existing speech enhancement models can be successfully trained to perform vocoding tasks, and a single model, when jointly trained, can effectively handle both tasks with performance comparable to separately trained models. These results suggest that speech enhancement and neural vocoding can be unified under a broader framework of speech restoration. Code: https://github.com/Andong‑Li‑speech/Neural‑Vocoders‑as‑Speech‑Enhancers.
Authors: Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li
Abstract: Text‑based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous FluentEditor model, termed FluentEditor2, by modeling the multi‑scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose hierarchical local acoustic smoothness constraint to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose contrastive global prosody consistency constraint to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that FluentEditor2 surpasses existing neural networks‑based TSE methods, including Editspeech, Campnet, A^3T, FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: \urlhttps://github.com/Ai‑S2‑Lab/FluentEditor2.
Authors: Pin-Jui Ku, Alexander H. Liu, Roman Korostik, Sung-Feng Huang, Szu-Wei Fu, Ante Jukić
Abstract: This paper proposes a generative pretraining foundation model for high‑quality speech restoration tasks. By directly operating on complex‑valued short‑time Fourier transform coefficients, our model does not rely on any vocoders for time‑domain signal reconstruction. As a result, our model simplifies the synthesis process and removes the quality upper‑bound introduced by any mel‑spectrogram vocoder compared to prior work SpeechFlow. The proposed method is evaluated on multiple speech restoration tasks, including speech denoising, bandwidth extension, codec artifact removal, and target speaker extraction. In all scenarios, finetuning our pretrained model results in superior performance over strong baselines. Notably, in the target speaker extraction task, our model outperforms existing systems, including those leveraging SSL‑pretrained encoders like WavLM. The code and the pretrained checkpoints are publicly available in the NVIDIA NeMo framework.
Authors: Yuhang Jia, Yang Chen, Jinghua Zhao, Shiwan Zhao, Wenjia Zeng, Yong Chen, Yong Qin
Abstract: Diffusion‑based text‑to‑audio (TTA) generation has made substantial progress, leveraging latent diffusion model (LDM) to produce high‑quality, diverse and instruction‑relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training‑free audio editing framework built on the pretrained diffusion‑based TTA model. AudioEditor incorporates Null‑text Inversion and EOT‑suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high‑quality audio edits. Code and demo can be found at https://github.com/NKU‑HLT/AudioEditor.
Authors: Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual
Abstract: Speech restoration aims at restoring full‑band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre‑trained self‑supervised teacher model. Then, a masked language model is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.
Authors: Kai Li, Yi Luo
Abstract: Audio restoration has become increasingly significant in modern society, not only due to the demand for high‑quality auditory experiences enabled by advanced playback devices, but also because the growing capabilities of generative audio models necessitate high‑fidelity audio. Typically, audio restoration is defined as a task of predicting undistorted audio from damaged input, often trained using a GAN framework to balance perception and distortion. Since audio degradation is primarily concentrated in mid‑ and high‑frequency ranges, especially due to codecs, a key challenge lies in designing a generator capable of preserving low‑frequency information while accurately reconstructing high‑quality mid‑ and high‑frequency content. Inspired by recent advancements in high‑sample‑rate music separation, speech enhancement, and audio codec models, we propose Apollo, a generative model designed for high‑sample‑rate audio restoration. Apollo employs an explicit frequency band split module to model the relationships between different frequency bands, allowing for more coherent and higher‑quality restored audio. Evaluated on the MUSDB18‑HQ and MoisesDB datasets, Apollo consistently outperforms existing SR‑GAN models across various bit rates and music genres, particularly excelling in complex scenarios involving mixtures of multiple instruments and vocals. Apollo significantly improves music restoration quality while maintaining computational efficiency. The source code for Apollo is publicly available at https://github.com/JusperLee/Apollo.
Authors: Chuanbo Zhu, Wuyou Zhou, Rongxiu Zhong, Shilei Zhang, Kun Qian, Yike Guo, Wei Xue
Abstract: Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word‑level content modification and typically treat content, speaker, and emotion editing as separate tasks, limiting both editing granularity and flexibility. We propose UniSAE, a unified speech attribute editing framework which supports composable speaker, emotion and content editing from sub‑phoneme to word level within a single architecture. UniSAE introduces a Discrete Phonetic PosteriorGram (DPPG) representation that factorizes speech content into discrete tokens encoding phoneme identity, pronunciation variants, and duration, enabling direct phoneme‑ and sub‑phoneme‑level editing. For higher‑level modifications, an autoregressive content transformer predicts edited DPPG sequences for word‑level content editing. The edited sequences are rendered into speech by a diffusion‑based acoustic decoder, conditioned on disentangled speaker and emotion representations. Experimental results demonstrate that the proposed unified framework supports precise speaker and emotion control, content editing at multiple granularities, and joint modification of all three attributes within a single framework.
Authors: Buwaneka Epakanda, Athulya Ratnayake, Pandula Thennakoon, Mario De Silva, Avishka Ranasinghe, Roshan Godaliyadda, Parakrama Ekanayake
Abstract: Implicit Neural Representations (INRs) parameterized by multilayer perceptrons excel at modeling continuous signals. However, a key challenge persists as INRs fundamentally suffer from spectral bias and information cross‑talk. When a single network attempts to capture multi‑scale phenomena, high‑frequency weight updates destructively interfere with the underlying low‑frequency structural approximation. We introduce Scale and Learn INR (ScaLe‑INR), a novel multi‑branch architecture that resolves these limitations by explicitly matching the signal's frequency spectrum with the optimal operating region of the INR. Drawing upon the Fourier inverse scaling theorem we demonstrate that applying directional coordinate scaling expands a network's representational bandwidth along specific spatial axes. To mathematically enforce functional disentanglement and minimize task‑specific information leakage between branches, we propose a Directional Edge Guidance Loss, a spatially‑conditioned sparsity prior derived from ground‑truth gradients. By constraining the high‑frequency branches to act as strict, localized edge‑filters, ScaLe‑INR eliminates spectral cross‑talk, accelerates convergence, and achieves high‑fidelity signal reconstruction on complex multi‑scale topologies. We evaluate ScaLe‑INR across diverse reconstruction and inverse tasks, demonstrating substantial performance gains over existing state‑of‑the‑art (SOTA) methods. The proposed architecture improves upon the nearest baselines by +5.16 dB in image reconstruction and +0.65 dB in image denoising. Furthermore, it achieve an impressive figure of 50.02 dB on audio reconstruction and 0.999 IOU(Intersection Over Union) on 3D reconstruction which beats the all SOTA models.
Authors: Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang
Abstract: Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training‑based editing methods mainly rely on the local inductive biases and cross‑attention interaction in convolutional U‑Net backbones, which often hinder long‑range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two‑stage diffusion transformer architecture for instruction‑guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low‑resolution stage, then switches to alternating joint‑attention and cross‑attention blocks to refine editing details at high‑resolution stage. This coarse‑to‑fine strategy enables efficient and accurate instruction‑guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.
Authors: Yuxuan Jiang, Mingyang Han, Yusheng Dai, Andong Wang, Tianhong Zhou, Jiaxin Ye, Dongxiao Wang, Haoxiang Shi, Boyu Li, Jun Song, Cheng Yu, Bo Zheng, Weibei Dou, Zehua Chen, Jun Zhu
Abstract: Text‑to‑audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training‑free framework leveraging the state‑of‑the‑art Rectified Flow‑based TangoFlux model. FreeSonic utilizes an optimized inversion‑reverse process and joint text‑audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task‑oriented noise injection enhances versatility for tasks such as audio removal and non‑rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high‑fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free‑sonic.github.io/
Authors: Zhongyuan Fu
Abstract: We introduce AudEdit, an inversion‑free method for text‑guided editing of real audio with a pretrained rectified‑flow audio generator. Text‑to‑audio systems such as Stable Audio 3 already expose audio‑to‑audio editing by noising an input recording and denoising it under a new prompt, but this inversion‑style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long‑range musical structure. Motivated by recent inversion‑free flow editing in computer vision, we develop an audio‑specific direct source‑to‑target ordinary differential equation for one‑dimensional Stable Audio 3 latents: at each flow step, we compare the target‑ and source‑conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound‑effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target‑text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.
Authors: Zhengkun Ge, Xiaoqian Liu, Haoran Zhang, Yuan Ge, Junxiang Zhang, Zhengtao Yu, Jingbo Zhu, Tong Xiao
Abstract: Text‑guided audio editing aims to modify the language‑specified acoustic content while preserving edit‑irrelevant source components. Existing training‑free methods typically rely on inversion‑based editing. While inversion‑free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source‑to‑target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training‑free and inversion‑free method for audio editing. Experiments on music and event‑level benchmarks across two backbones show that DirectAudioEdit reduces macro‑averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.
Authors: Yinan Chen, Chuming Lin, Zhennan Chen, Yuxiang Zeng, Junwei Zhu, Yali Bi, Xijie Huang, Chengming Xu, Donghao Luo, Zhucun Xue, Xiaobin Hu, Chengjie Wang, Yong Liu, Jiangning Zhang, Shuicheng Yan
Abstract: While instruction‑based video editing has seen significant progress, joint audio‑visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit‑100k, the first large‑scale, high‑quality dataset tailored for instruction‑guided joint audio‑visual editing. Focusing on human‑centric videos, JAVEdit‑100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent‑in‑the‑loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human‑aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction‑guided joint audio‑visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.
Authors: Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Linqi Song
Abstract: Instruction‑guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce SpeechEditBench, a bilingual multi‑attribute benchmark for instruction‑guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor‑based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed‑source Speech LLMs generally outperform open‑source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next‑generation Speech LLMs with more robust and precise instruction‑guided editing capabilities. Data and code are avaialble at https://github.com/daxintan‑cuhk/SpeechEditBench .
Authors: Zhaoqing Li, Haoning Xu, Jingran Su, Yaofang Liu, Zhefan Rao, Huimeng Wang, Jiajun Deng, Tianzi Wang, Zengrui Jin, Rui Liu, Haoxuan Che, Xunying Liu
Abstract: We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text‑to‑audio, text‑to‑speech, zero‑shot speaker cloning, mixed speech‑and‑sound generation, scene‑level audio editing, speech‑in‑scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer‑wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM‑DiT blocks via learned projections, providing depth‑matched semantic conditioning that improves instruction following over single‑layer baselines; and (2) a unified multi‑task architecture where task identity is encoded solely by a channel‑wise mask and source audio is provided through VAE‑encoded channel concatenation. Training is stabilized by an online GPU‑side multi‑task data synthesis pipeline with task‑homogeneous batching and a two‑stage curriculum. With 621M‑‑732M trainable parameters, UNISON achieves results competitive with or exceeding task‑specialist models across evaluated domains, while being roughly 4× smaller than comparable unified systems.
Authors: Chih-Heng Chang, Keng-Seng Ho, Chih-Yu Tsai, Kuan-Lin Chen, Yi-Hsuan Yang, Jian-Jiun Ding
Abstract: Controllable music editing is to modify high‑level attributes while strictly preserving rhythmic and melodic structures. However, this task is challenged by a semantic‑structural entanglement: steering methods often degrade structure to achieve editing performance, while structural adaptors suppress semantic responsiveness. We propose AnchorSteer, a framework that disentangles this tension by coupling structural anchoring with self‑discovered semantic steering. The proposed approach probes internal representations to extract interpretable, label‑free concept vectors via a self‑supervised reconstruction objective, isolating attributes without curated data. During editing, these portable, plug‑and‑play concept vectors are injected into diffusion hidden manifolds while a structural adaptor enforces consistency. Variants for unconditioned and conditioned injections are provided to balance robustness and semantic strength. Experiments on ZoME‑Bench and subjective tests show that the proposed framework outperforms both steering‑only and anchoring‑only baselines, enabling significant semantic transformations with high‑fidelity structural preservation.
Authors: Nelly Garcia, Joshua Reiss
Abstract: Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed‑methods study comprising a survey of 76 practitioners and follow‑up semi‑structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast‑consumption media contexts but lack the narrative sophistication required for high‑end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task‑specific applications, particularly in audio restoration and library management, over end‑to‑end generative systems. This work contributes to the on‑going discussion on the use of AI and AI‑enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.
Authors: Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan
Abstract: Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame‑level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token‑space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non‑temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low‑bitrate speech coding settings while enabling simple token‑space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task‑specific editing models.
Authors: Haowen Li, Tianxiang Li, Yi Yang, Boyu Cao, Qi Liu
Abstract: The advancement of diffusion‑based text‑to‑music generation has opened new avenues for zero‑shot music editing. However, existing methods fail to achieve stem‑specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real‑world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross‑attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero‑shot editing framework with Acoustic‑Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non‑target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non‑target integrity.
Authors: Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran
Abstract: Recent advances in voice cloning and text‑to‑speech synthesis have made partial speech manipulation ‑ where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity ‑ an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance‑level binary classification or single‑region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large‑scale multilingual dataset spanning 6 languages with 1‑3 independently inpainted word‑level segments per utterance, generated via LLM‑guided semantic replacement and neural voice cloning, with fake content constituting only 2‑7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone‑agnostic framework that performs coarse‑to‑fine sliding‑window classification with gap‑tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment‑level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero‑shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance‑level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2‑7% of content is manipulated. ISA consistently outperforms non‑iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.
Authors: Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang
Abstract: Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer‑based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non‑uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform‑length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano‑roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event‑based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long‑range patterns with the proposed tokenization.
Authors: Yoonmin Cha, Dawit Chun, Sung Park
Abstract: Brain‑computer interfaces (BCIs) for speech restoration hold transformative potential for the approximately 173,000‑‑232,500 individuals worldwide with ALS‑related dysarthria. Despite recent progress, high‑performance speech BCIs have been demonstrated in only 22‑‑31 patients globally, largely due to limitations in neural decoding accuracy and practical input interfaces. We present iPhoneme, a brain‑to‑text communication system that jointly addresses these challenges through integrated modeling and interaction design. The system combines a deep learning phoneme decoder based on a modified Conformer architecture (ConformerXL, 192.9M parameters) with a gaze‑assisted phoneme input interface that mitigates the Midas touch problem in eye‑tracking systems. The acoustic model incorporates a temporal prenet with multi‑scale dilated convolutions and bidirectional GRU for neural jitter correction, temporal subsampling for CTC stability, and Pre‑RMSNorm stabilization across 12 encoder blocks, trained with AdamW and cosine scheduling. On the interaction side, iPhoneme introduces a chorded gaze‑plus‑silent‑speech paradigm that replaces dwell‑time selection, enabling more efficient input. We evaluate the system on the T15 dataset (45 sessions, 8,071 trials) of 256‑channel intracranial EEG from speech motor cortex regions. A 6‑gram phoneme language model trained on 3.1M sequences, combined with WFST beam search (beam=128), achieves 92.14% phoneme accuracy (7.86% PER) and 73.39% word accuracy (26.61% WER), approximately 3% above prior state‑of‑the‑art. The system operates on CPU with 180 ms latency, demonstrating real‑time, high‑accuracy brain‑to‑text communication for ALS.
Authors: Sihan Lv, Yechen Jin, Zhen Li, Jintao Chen, Jinshan Zhang, Ying Li, Jianwei Yin, Meng Xi
Abstract: Text‑based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task‑specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text‑to‑Speech (TTS) models often faces a trade‑off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training‑free precise speech editing framework. Leveraging a pre‑trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a mel‑space guidance signal, enforcing structural constraints only where necessary without disrupting the generative manifold. To fill the gap of publicly accessible benchmarks, we introduce LibriSpeech‑Edit, a new and larger speech editing dataset. As existing metrics poorly evaluate temporal consistency in unedited regions, we propose Word‑level Dynamic Time Warping (WDTW). Extensive experiments demonstrate that AST resolves the controllability‑quality trade‑off without extra training. Compared to the previous most temporally consistent baseline, AST improves consistency while reducing Word Error Rate by nearly 70%. Moreover, applying AST to a foundation TTS model reduces WDTW by 27%, achieving state‑of‑the‑art speaker preservation and temporal fidelity.
Authors: Mingda Han, Huanqi Yang, Chaoqun Li, Wenhao Li, Guoming Zhang, Yanni Yang, Yetong Cao, Weitao Xu, Pengfei Hu
Abstract: Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar‑sensed throat vibrations. VoxAnchor uses contactless millimeter‑wave radar to capture fine‑grained throat vibrations that are tightly coupled with human speech production, establishing a hard‑to‑forge anchor rooted in human physiology. The design comprises three main components: (1) a cross‑modal frame‑work that uses modality‑specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase‑aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual‑stage strategy that combines signal‑level onset detection and semantic‑level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word‑level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine‑grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.
Authors: Xiaoyu Fan, Huizhi Xie, Wei Zou, Yunzhang Chen
Abstract: Large language model (LLM)‑based text‑to‑speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA‑TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine‑tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA‑TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed‑TTS‑Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM‑stage speedup‑‑a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture naturally enables zero‑shot speech editing‑‑including word‑level insertion, deletion, and substitution‑‑without any additional training. Theoretically, we prove that AR‑pretrained weights are near‑optimal for bidirectional masked prediction under the locality property of acoustic tokens, explaining this rapid convergence. This general method modifies only the attention mask and objective, applying seamlessly to any LLM‑based AR TTS system. Code and audio samples will be available at https://deft‑piroshki‑b652b5.netlify.app/.
Authors: Xintao Hu, Feng-Qi Cui
Abstract: With the emergence of AI techniques for depression diagnosis, the conflict between high demand and limited supply for depression screening has been significantly alleviated. Among various modal data, audio‑based depression diagnosis has received increasing attention from both academia and industry since audio is the most common carrier of emotion transmission. Unfortunately, audio data also contains User‑sensitive Identity Information (ID), which is extremely vulnerable and may be maliciously used during the smart diagnosis process. Among previous methods, the clarification between depression features and sensitive features has always serve as a barrier. It is also critical to the problem for introducing a safe encryption methodology that only encrypts the sensitive features and a powerful classifier that can correctly diagnose the depression. To track these challenges, by leveraging adversarial loss‑based Subspace Decomposition, we propose a first practical framework \name presented for Trustable Audio Affective Computing, to perform automated depression detection through audio within a trustable environment. The key enablers of TAAC are Differentiating Features Subspace Decompositor (DFSD), Flexible Noise Encryptor (FNE) and Staged Training Paradigm, used for decomposition, ID encryption and performance enhancement, respectively. Extensive experiments with existing encryption methods demonstrate our framework's preeminent performance in depression detection, ID reservation and audio reconstruction. Meanwhile, the experiments across various setting demonstrates our model's stability under different encryption strengths. Thus proving our framework's excellence in Confidentiality, Accuracy, Traceability, and Adjustability.
Authors: Sung Kyun Chung, Jiaheng Dong, Qiuchi Hu, Gongping Huang, Hong Jia, Ting Dang
Abstract: Large Audio‑Language Models (LALMs) have shown strong performance in speech understanding, making speech a natural interface for accessing factual information. Yet they are trained on static corpora and may encode incorrect facts. Existing model editing methods localize and update facts in text‑only LLMs, but do not account for continuous speech representations, or where knowledge is stored across acoustic or language modules, or their cross‑modal module. We construct the first audio benchmark for knowledge localization and editing in LALMs and propose a speech‑driven locate‑then‑edit framework. First, we use speech‑aware causal tracing to localize layers and modules that support factual retrieval and then apply editing at identified sites. Experiments show that factual knowledge is jointly encoded in audio and text modules, and that audio editing yields more effective updates than text editing or fine‑tuning, enabling fine‑grained knowledge control in speech AI systems.
Authors: Shree Harsha Bokkahalli Satish, Harm Lameris, Joakim Gustafson, Éva Székely
Abstract: Audio anti‑spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation‑modifying voice conversion and speech restoration are treated as out‑of‑distribution despite preserving speaker authenticity. Using a multi‑class setup separating bona fide, converted, spoofed, and converted‑spoofed speech, we analyse model behaviour through self‑supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti‑spoofing as a multi‑class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.
Authors: Yongjoon Lee, Jung-Woo Choi
Abstract: General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State‑Space Models like SEMamba have advanced the state‑of‑the‑art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi‑resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech‑specific features as inductive biases. In particular, we propose Frequency GLP, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi‑resolution parallel time‑frequency dual‑processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.
Authors: Heinrich Dinkel, Xingwei Sun, Gang Li, Jiahao Mei, Yadong Niu, Jizhong Liu, Xiyang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan
Abstract: This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text‑to‑audio (TTA), text‑to‑music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)‑based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general‑purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE‑based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.
Authors: Trung Dang, Sharath Rao, Ananya Gupta, Christopher Gagne, Panagiotis Tzirakis, Alice Baird, Jakub Piotr Cłapa, Peter Chin, Alan Cowen
Abstract: Modern Text‑to‑Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high‑fidelity, zero‑shot generation. However, these systems typically rely on fixed‑frame‑rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one‑to‑one synchronization between continuous acoustic features and text tokens, enabling unified, single‑stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high‑fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text‑only guidance‑‑a technique that blends logits from text‑only and text‑speech modes to flexibly bridge the gap toward text‑only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state‑of‑the‑art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
Authors: Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, Qing He, Xubo Liu
Abstract: Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade‑offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic‑rich representations through supervised learning and enables high‑fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of 12.5 Hz and a bit‑rate of 200 bits‑per‑second.
Authors: Kaiyuan Zhang, Mohan Shi, Eray Eren, Natarajan Balaji Shankar, Zilai Wang, Abeer Alwan
Abstract: Neural audio codecs are widely used for audio compression and can be integrated into token‑based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self‑supervised learning (SSL) models into the first layer of residual vector quantization (RVQ‑1) via semantic token assignment (STA). To further eliminate reliance on SSL‑based semantic tokenizers and improve efficiency during inference, we propose a semantic pre‑distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.
Authors: Haina Zhu, Yao Xiao, Xiquan Li, Ziyang Ma, Jianwei Yu, Bowen Zhang, Mingqi Yang, Xie Chen
Abstract: We study the fine‑grained text‑to‑audio (T2A) generation task. While recent models can synthesize high‑quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre‑trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A‑ControlNet and T2A‑Adapter, and show that the T2A‑Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A‑Adapter achieves state‑of‑the‑art performance on the AudioSet‑Strong in both event‑level and segment‑level F1 scores. We further extend this framework to audio editing, proposing T2A‑Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
Authors: Yong Ren, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Tao Wang
Abstract: Imperceptible text‑based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content‑style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of "Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self‑Consistency Rewards Group Relative Policy Optimization. By leveraging a pre‑trained Text‑to‑Speech model as an implicit critic ‑‑ complemented by strict intelligibility and duration constraints ‑‑ we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state‑of‑the‑art autoregressive and non‑autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.
Authors: Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen
Abstract: Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame‑level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion‑type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, https://huggingface.co/datasets/JunXueTech/AiEdit, a large‑scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state‑of‑the‑art end‑to‑end speech editing systems, providing a more realistic benchmark for modern threats. Building upon this, we reformulate SED as a structured text generation task, enabling joint reasoning over edit type identification, and content localization. To enhance the grounding of generative models in acoustic evidence, we propose a prior‑enhanced prompting strategy that injects word‑level probabilistic cues derived from a frame‑level detector. Furthermore, we introduce an acoustic consistency‑aware loss that explicitly enforces the separation between normal and anomalous acoustic representations in the latent space. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across both detection and localization tasks.
Authors: Peter Balušík, Pavel Rajmic
Abstract: We address the problem of time‑frequency audio inpainting, where the goal is to fill missing spectrogram portions with reliable information. Despite recent advances, existing approaches still face limitations in both reconstruction quality and computational efficiency. To bridge this gap, we propose a method that utilizes a phase‑aware signal prior which exploits estimates of the instantaneous frequency. An optimization problem is formulated and solved using the generalized Chambolle‑Pock algorithm. The proposed method is evaluated against other time‑frequency inpainting methods, specifically a deep‑prior audio inpainting neural network and the autoregression‑based approach known as Janssen‑TF. Our proposed approach surpassed these methods by a large margin in the objective evaluation as well as in the conducted subjective listening test, improving the state of the art. In addition, the reconstructions are obtained with a substantially reduced computational cost compared to alternative methods.
Authors: Carlos Hernandez-Olivan, Hendrik Vincent Koops, Hao Hao Tan, Elio Quinton
Abstract: Audio restoration consists in inverting degradations of a digital audio signal to recover what would have been the pristine quality signal before the degradation occurred. This is valuable in contexts such as archives of music recordings, particularly those of precious historical value, for which a clean version may have been lost or simply does not exist. Recent work applied generative models to audio restoration, showing promising improvement over previous methods, and opening the door to the ability to perform restoration operations that were not possible before. However, making these models finely controllable remains a challenge. In this paper, we propose an extension of FLowHigh and introduce the Dynamic Spectral Contour (DSC) as a control signal for bandwidth extension via classifier‑free guidance. Our experiments show competitive model performance, and indicate that DSC is a promising feature to support fine‑grained conditioning.
Authors: Naqcho Ali Mehdi, Mohammad Adeel, Aizaz Ali Larik
Abstract: We present SoundPlot, an open‑source framework for analyzing avian vocalizations through acoustic feature extraction, dimensionality reduction, and neural audio synthesis. The system transforms audio signals into a multi‑dimensional acoustic feature space, enabling real‑time visualization of temporal dynamics in 3D using web‑based interactive graphics. Our framework implements a complete analysis‑synthesis pipeline that extracts spectral features (centroid, bandwidth, contrast), pitch contours via probabilistic YIN (pYIN), and mel‑frequency cepstral coefficients (MFCCs), mapping them to a unified timbre space for visualization. Audio reconstruction employs the Griffin‑Lim phase estimation algorithm applied to mel spectrograms. The accompanying Three.js‑based interface provides dual‑viewport visualization comparing original and synthesized audio trajectories with independent playback controls. We demonstrate the framework's capabilities through comprehensive waveform analysis, spectrogram comparisons, and feature space evaluation using Principal Component Analysis (PCA). Quantitative evaluation shows mel spectrogram correlation scores exceeding 0.92, indicating high‑fidelity preservation of perceptual acoustic structure. SoundPlot is released under the MIT License to facilitate research in bioacoustics, audio signal processing, and computational ethology.
Authors: Jing Zhang, Bingjie Fan
Abstract: Existing image emotion editing methods struggle to disentangle emotional cues from latent content representations, often yielding weak emotional expression and distorted visual structures. To bridge this gap, we propose EmoKGEdit, a novel training‑free framework for precise and structure‑preserving image emotion editing. Specifically, we construct a Multimodal Sentiment Association Knowledge Graph (MSA‑KG) to disentangle the intricate relationships among objects, scenes, attributes, visual clues and emotion. MSA‑KG explicitly encode the causal chain among object‑attribute‑emotion, and as external knowledge to support chain of thought reasoning, guiding the multimodal large model to infer plausible emotion‑related visual cues and generate coherent instructions. In addition, based on MSA‑KG, we design a disentangled structure‑emotion editing module that explicitly separates emotional attributes from layout features within the latent space, which ensures that the target emotion is effectively injected while strictly maintaining visual spatial coherence. Extensive experiments demonstrate that EmoKGEdit achieves excellent performance in both emotion fidelity and content preservation, and outperforms the state‑of‑the‑art methods.
Authors: Diqiong Jiang, Kai Zhu, Dan Song, Jian Chang, Chenglizhao Chen, Zhenyu Wu
Abstract: Speech‑driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio. While recent methods achieve high‑quality lip synchronization, they often rely on discrete emotion categories, limiting continuous and fine‑grained emotional control. We present EditEmoTalk, a controllable speech‑driven 3D facial animation framework with continuous emotion editing. The key idea is a boundary‑aware semantic embedding that learns the normal directions of inter‑emotion decision boundaries, enabling a continuous expression manifold for smooth emotion manipulation. Moreover, we introduce an emotional consistency loss that enforces semantic alignment between the generated motion dynamics and the target emotion embedding through a mapping network, ensuring faithful emotional expression. Extensive experiments demonstrate that EditEmoTalk achieves superior controllability, expressiveness, and generalization while maintaining accurate lip synchronization. Code and pretrained models will be released.
Authors: Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, Yong Qin
Abstract: Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end‑to‑end speech editing model adapted from CosyVoice through task‑specific fine‑tuning and an optimized inference procedure, which internalizes speech‑text alignment while ensuring high consistency between the speech before and after editing. By fine‑tuning on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M‑parameter model achieves reliable speech editing performance. Experiments on the RealEdit benchmark indicate that CosyEdit not only outperforms several billion‑parameter language model baselines but also matches the performance of state‑of‑the‑art cascade approaches. These results demonstrate that, with task‑specific fine‑tuning and inference optimization, robust and efficient speech editing capabilities can be unlocked from a zero‑shot TTS model, yielding a novel and cost‑effective end‑to‑end solution for high‑quality speech editing.
Authors: Ye Tao, Wen Wu, Chao Zhang, Mengyue Wu, Shuai Wang, Xuenan Xu
Abstract: Text‑guided audio editing aims to modify specific acoustic events while strictly preserving non‑target content. Despite recent progress, existing approaches remain fundamentally limited. Training‑free methods often suffer from signal degradation caused by diffusion inversion, while training‑based methods, although achieving higher generation quality, are severely constrained by the scarcity of high‑quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts.
To address these challenges, we propose MMEdit, an audio‑language‑model‑driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition, replacement, removal, reordering, and attribute modification. Furthermore, we design a scalable data synthesis pipeline to construct large‑scale paired datasets with fine‑grained event‑level annotations. To capture complex editing semantics, we integrate a Qwen2‑Audio encoder with an MMDiT‑based generator, enabling precise cross‑modal alignment and localized editing.
Experimental results demonstrate that our method achieves superior editing localization accuracy, robust instruction following, and high fidelity in non‑edited regions.
Authors: Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Xiaofu Chen, Bin Gong, Zheng Xue, Gang Song
Abstract: Many existing audio processing and generation models rely on task‑specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high‑quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high‑fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder‑only autoregressive (AR) LM‑based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H‑Codec, which incorporates self‑supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H‑Codec, such as a dynamic frame‑rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task‑specific conditional information as the conditioning sequence of the decoder‑only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language‑queried audio source separation (LASS). In addition, we extend downstream tasks to universal free‑form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H‑Codec achieves high‑quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state‑of‑the‑art task‑specific or multi‑task systems across multiple tasks.
Authors: Simon Welker, Bunlong Lay, Maris Hillemann, Tal Peer, Timo Gerkmann
Abstract: Diffusion‑based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real‑time communication is, however, still lagging behind due to their computation‑heavy nature involving multiple calls of large DNNs.
Here, we present Stream.FM, a frame‑causal flow‑based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real‑time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few‑step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task.
Our work looks beyond theoretical latencies, showing that high‑quality streaming generative speech processing can be realized on consumer GPUs available today. Stream.FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post‑filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream.FM establishes a state‑of‑the‑art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non‑streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.
Authors: Yash Vishe, Eric Xue, Xunyi Jiang, Zachary Novack, Junda Wu, Julian McAuley, Xin Xu
Abstract: Music editing plays a vital role in modern music production, with applications in film, broadcasting, and game development. Recent advances in music generation models have enabled diverse editing tasks such as timbre transfer, instrument substitution, and genre transformation. However, many existing works overlook the evaluation of their ability to preserve musical facets that should remain unchanged during editing a property we define as Music Context Preservation (MCP). While some studies do consider MCP, they adopt inconsistent evaluation protocols and metrics, leading to unreliable and unfair comparisons. To address this gap, we introduce the first MCP evaluation benchmark, MuseCPBench, which covers four categories of musical facets and enables comprehensive comparisons across five representative music editing baselines. Through systematic analysis along musical facets, methods, and models, we identify consistent preservation gaps in current music editing methods and provide insightful explanations. We hope our findings offer practical guidance for developing more effective and reliable music editing strategies with strong MCP capability
Authors: Andreas Papageorgiou, Paulo Vitor Itaborai, Kostas Blekos, Karl Jansen
Abstract: This paper introduces quantum circuit methodologies for pointwise multiplication and convolution of complex functions, conceptualized as "processing through encoding". Leveraging known techniques, we describe an approach where multiple complex functions are encoded onto auxiliary qubits. Applying the proposed scheme for two functions f and g, their pointwise product f(x)g(x) is shown to naturally form as the coefficients of part of the resulting quantum state. Adhering to the convolution theorem, we then demonstrate how the convolution fg can be constructed. Similarly to related work, this involves the encoding of the Fourier coefficients \mathcalF[f] and \mathcalF[g], which facilitates their pointwise multiplication, followed by the inverse Quantum Fourier Transform. We discuss the simulation of these techniques, their integration into an extended \verb|quantumaudio| package for audio signal processing, and present initial experimental validations. This work offers a promising avenue for quantum signal processing, with potential applications in areas such as quantum‑enhanced audio manipulation and synthesis.
Authors: Shuhan Xia, Xuannan Liu, Xing Cui, Peipei Li
Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3‑Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3‑Tracer consists of two complementary core modules: the Frame‑Audio Feature Aggregation Module (FA‑FAM) and the Segment‑level Multi‑Scale Discrepancy‑Aware Module (SMDAM). FA‑FAM is designed to detect the authenticity of each audio frame. It combines both frame‑level and audio‑level temporal information to detect intra‑frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual‑branch architecture that jointly models frame features and inter‑frame differences across multi‑scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state‑of‑the‑art performance.
Authors: Luca A. Lanzendörfer, Florian Grötschla
Abstract: Neural audio codecs have gained recent popularity for their use in generative modeling as they offer high‑fidelity audio reconstruction at low bitrates. While human listening studies remain the gold standard for assessing perceptual quality, they are time‑consuming and impractical. In this work, we examine the reliability of existing objective quality metrics in assessing the performance of recent neural audio codecs. To this end, we conduct a MUSHRA listening test on high‑fidelity speech signals and analyze the correlation between subjective scores and widely used objective metrics. Our results show that, while some metrics align well with human perception, others struggle to capture relevant distortions. Our findings provide practical guidance for selecting appropriate evaluation metrics when using neural audio codecs for speech.
Authors: Plein Versace
Abstract: Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing signals such as images, audio, and 3D scenes. However, existing INR frameworks ‑‑ including MLPs with Fourier features, SIREN, and multiresolution hash grids ‑‑ implicitly assume a global and stationary spectral basis. This assumption is fundamentally misaligned with real‑world signals whose frequency characteristics vary significantly across space, exhibiting local high‑frequency textures, smooth regions, and frequency drift phenomena. We propose Neural Spectral Transport Representation (NSTR), the first INR framework that explicitly models a spatially varying local frequency field. NSTR introduces a learnable \emphfrequency transport equation, a PDE that governs how local spectral compositions evolve across space. Given a learnable local spectrum field S(x) and a frequency transport network F_θ enforcing \nabla S(x) \approx F_θ(x, S(x)), NSTR reconstructs signals by spatially modulating a compact set of global sinusoidal bases. This formulation enables strong local adaptivity and offers a new level of interpretability via visualizing frequency flows. Experiments on 2D image regression, audio reconstruction, and implicit 3D geometry show that NSTR achieves significantly better accuracy‑parameter trade‑offs than SIREN, Fourier‑feature MLPs, and Instant‑NGP. NSTR requires fewer global frequencies, converges faster, and naturally explains signal structure through spectral transport fields. We believe NSTR opens a new direction in INR research by introducing explicit modeling of space‑varying spectrum.
Authors: Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath
Abstract: We introduce VoiceCraft‑X, an autoregressive neural codec language model which unifies multilingual speech editing and zero‑shot Text‑to‑Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft‑X utilizes the Qwen3 large language model for phoneme‑free cross‑lingual text processing and a novel token reordering mechanism with time‑aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high‑quality, natural‑sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft‑X shows robust performance in diverse linguistic settings, even with limited per‑language data, underscoring the power of unified autoregressive approaches for advancing complex, real‑world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft‑x/.
Authors: Yi Yang, Haowen Li, Tianxiang Li, Boyu Cao, Xiaohan Zhang, Liqun Chen, Qi Liu
Abstract: Text‑to‑music generation technology is progressing rapidly, creating new opportunities for musical composition and editing. However, existing music editing methods often fail to preserve the source music's temporal structure, including melody and rhythm, when altering particular attributes like instrument, genre, and mood. To address this challenge, this paper conducts an in‑depth probing analysis on attention maps within AudioLDM 2, a diffusion‑based model commonly used as the backbone for existing music editing methods. We reveal a key finding: cross‑attention maps encompass details regarding distinct musical characteristics, and interventions on these maps frequently result in ineffective modifications. In contrast, self‑attention maps are essential for preserving the temporal structure of the source music during its conversion into the target music. Building upon this understanding, we present Melodia, a training‑free technique that selectively manipulates self‑attention maps in particular layers during the denoising process and leverages an attention repository to store source music information, achieving accurate modification of musical characteristics while preserving the original structure without requiring textual descriptions of the source music. Additionally, we propose two novel metrics to better evaluate music editing methods. Both objective and subjective experiments demonstrate that our approach achieves superior results in terms of textual adherence and structural integrity across various datasets. This research enhances comprehension of internal mechanisms within music generation models and provides improved control for music creation.
Authors: Andong Li, Tong Lei, Rilin Chen, Kai Li, Meng Yu, Xiaodong Li, Dong Yu, Chengshi Zheng
Abstract: This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel‑spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range‑space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time‑frequency (T‑F) domain, we elaborately devise a novel subband‑aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional‑style attention module is employed with large kernels for efficient T‑F contextual modeling. To enable single‑step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target‑related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out‑of‑distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof‑the‑art performance over existing advanced GAN‑, DDPMand flow‑matching‑based baselines with only 4 sampling steps. And consistent superiority is still achieved with single‑step inference.
Authors: Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, Kaimeng Ren, Ming Yang, Mingxue Yang, Qiang Xu, Qin Zhao, Ruijie Xiong, Shaoxiong Lin, Xuezhi Wang, Yi Yuan, Yifei Wu, Yongjie Lyu, Zhengyu He, Zhihao Qiu, Zhiqiang Fang, Ziyuan Huang
Abstract: Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction‑based free‑form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok‑Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming‑UniAudio, which achieved a balance between generation and understanding capabilities. Ming‑UniAudio sets new state‑of‑the‑art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably, for Chinese voice cloning, it achieves a highly competitive Seed‑TTS‑WER of 0.95. Leveraging this foundational model, we further trained a dedicated speech editing model Ming‑UniAudio‑Edit, the first speech language model that enables universal, free‑form speech editing guided solely by natural language instructions, handling both semantic and acoustic modifications without timestamp condition. To rigorously assess the editing capability and establish a foundation for future research, we introduce Ming‑Freeform‑Audio‑Edit, the first comprehensive benchmark tailored for instruction‑based free‑form speech editing, featuring diverse scenarios and evaluation dimensions spanning semantic correctness, acoustic quality, and instruction alignment. We open‑sourced the continuous audio tokenizer, the unified foundational model, and the free‑form instruction‑based editing model to facilitate the development of unified audio understanding, generation, and manipulation.
Authors: Ali Boudaghi, Hadi Zare
Abstract: Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task‑specific retraining, thus lacking true zero‑shot capability. leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, a zero‑shot text‑to‑music editing model capable of performing diverse editing tasks on real‑world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real‑world scenarios.
Authors: Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Li Xie, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Shuchang Zhou, Gang Yu
Abstract: We present Step‑Audio‑EditX, the first open‑source LLM‑based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero‑shot text‑to‑speech (TTS) capabilities. Our core innovation lies in leveraging only large‑margin synthetic data, which circumvents the need for embedding‑based priors or auxiliary modules. This large‑margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation‑level disentanglement. Evaluation results demonstrate that Step‑Audio‑EditX surpasses both MiniMax‑2.6‑hd and Doubao‑Seed‑TTS‑2.0 in emotion editing and other fine‑grained control tasks.
Authors: Daniel Jimon, Mircea Vaida, Adriana Stan
Abstract: Audio denoising is critical in signal processing, enhancing intelligibility and fidelity for applications like restoring musical recordings. This paper presents a proof‑of‑concept for adapting a state‑of‑the‑art neural audio codec, the Descript Audio Codec (DAC), for music denoising. This work overcomes the limitations of traditional architectures like U‑Nets by training the model on a large‑scale, custom‑synthesized dataset built from diverse sources. Training is guided by a multi objective loss function that combines time‑domain, spectral, and signal‑level fidelity metrics. Ultimately, this paper aims to present a PoC for high‑fidelity, generative audio restoration.
Authors: Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer
Abstract: Generative models have made significant progress in synthesizing high‑fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO‑Instruct, a model based on Stable Audio Open capable of editing audio clips using any free‑form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt‑to‑Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in‑the‑wild audio clips and unseen edit instructions. We demonstrate that SAO‑Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.
Authors: Qianheng Xu
Abstract: Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi‑stage automatic speech recognition (ASR) and text‑to‑speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end‑to‑end waveform‑to‑waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional‑bidirectional LSTM encoder‑decoder with attention, whereas StutterFormer integrates a dual‑stream Transformer with shared acoustic‑linguistic representations. Both architectures are trained on paired stuttered‑fluent data synthesized from the SEP‑28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper‑Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end‑to‑end stutter‑to‑fluent speech conversion, offering new opportunities for inclusive human‑computer interaction, speech therapy, and accessibility‑oriented AI systems.
Authors: Junnuo Wang
Abstract: Recent advances in diffusion‑based generative models have enabled high‑quality text‑to‑audio synthesis, but fine‑grained acoustic control remains a significant challenge in open‑source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this "control gap" in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time‑varying control signals, loudness, pitch, spectral centroid, and timbre, for precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using Low‑Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85% of the original parameters to be trained. Experiments demonstrate that Audio Palette achieves fine‑grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as Frechet Audio Distance (FAD) and LAION‑CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence‑based conditioning, memory efficiency, and a three‑scale classifier‑free guidance mechanism for nuanced inference‑time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open‑source settings, enabling a more artist‑centric workflow in the broader context of music and sound information retrieval.
Authors: Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis
Abstract: We introduce MAVE (Mamba with Cross‑Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text‑conditioned voice editing and high‑fidelity text‑to‑speech (TTS) synthesis, built on a cross‑attentive Mamba backbone. MAVE achieves state‑of‑the‑art performance in speech editing and very competitive results in zero‑shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real‑world audio. By integrating Mamba for efficient audio sequence modeling with cross‑attention for precise text‑acoustic alignment, MAVE enables context‑aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40‑sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE ‑ edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE ‑ demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero‑shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post‑processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high‑fidelity voice editing and synthesis through the synergistic integration of structured state‑space modeling and cross‑modal attention.
Authors: Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang
Abstract: While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text‑to‑audio (T2A) generation, they still lag behind diffusion‑based models by a non‑trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM‑based framework that employs multiple isolated transformers with causal conditioning and anti‑causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM‑based and diffusion‑based T2A systems, achieving state‑of‑the‑art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi‑modal generation frameworks.
Authors: Christian Limberg, Fares Schulz, Zhe Zhang, Stefan Weinzierl
Abstract: This paper presents a novel approach to neural instrument sound synthesis using a two‑stage semi‑supervised learning framework capable of generating pitch‑accurate, high‑quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high‑dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two‑stage training paradigm: first, we train a pitch‑timbre disentangled 2D representation of audio samples using a Variational Autoencoder; second, we use this representation as conditioning input for a Transformer‑based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model's ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering: https://pgesam.faresschulz.com
Authors: Jingyi Li, Zhiyuan Zhao, Zhisheng Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Jiahao Wu, Qiuqiang Kong, Yu Li
Abstract: Large Audio Language Models (LALMs) have emerged with strong performance across diverse audio understanding tasks and can be further enhanced by neural audio codecs. Transitioning from multi‑layer residual vector quantizers to a single‑layer quantizer has been shown to facilitate more efficient downstream language models decoding. However, the ability of a single codebook to capture fine‑grained acoustic details remains limited, as the frequency‑variant nature of 1D tokenizers leads to redundancy. To address this issue, we propose MelTok, a two‑dimensional (2D) tokenizer that effectively compresses acoustic details of 44.1 KHz audio into a single codebook. The tokenizer encodes audio into a more compact representation than one‑dimensional tokenizers. Furthermore, to recover audio from mel‑spectrogram tokens, we propose a token‑based vocoder. Both objective and subjective evaluations demonstrate that MelTok achieves quality comparable to multi‑codebook codecs and outperforms existing state‑of‑the‑art neural codecs with a single codebook on high‑fidelity audio reconstruction. By preserving acoustic details, MelTok offers a strong representation for downstream understanding tasks.
Authors: Ilpo Viertola, Vladimir Iashin, Esa Rahtu
Abstract: Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation‑aware audio generation, which explicitly conditions sound synthesis on object‑level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine‑grained and visually localized control over audio generation. To support this task and further research on segmentation‑aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state‑of‑the‑art methods and sets a new standard for controllable, high‑fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at https://saganet.notion.site
Authors: Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk
Abstract: With the prevalence of artificial intelligence (AI)‑generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real‑world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open‑source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in‑ and out‑of‑domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.
Authors: Matthieu Cervera, Francesco Paissan, Mirco Ravanelli, Cem Subakan
Abstract: Free‑form, text‑based audio editing remains a persistent challenge, despite progress in inversion‑based neural methods. Current approaches rely on slow inversion procedures, limiting their practicality. We present a virtual‑consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model‑agnostic, requiring no fine‑tuning or architectural changes, and achieves substantial speed‑ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.
Authors: Alexander Wang, Chris Donahue, Dhruv Jain
Abstract: We propose a system to adapt a user's music to their exercise by aligning high‑energy music segments with intense intervals of the workout. Listening to music during exercise can boost motivation and performance. However, the structure of the music may be different from the user's natural phases of rest and work, causing users to rest longer than needed while waiting for a motivational section, or lose motivation mid‑work if the section ends too soon. To address this, our system, called RISE, automatically estimates the intense segments in music and uses component‑based music rearrangement techniques to dynamically extend and shorten different segments of the user's song to fit the ongoing exercise routine. Our system takes as input the rest and work durations to guide adaptation. Currently, this is determined either via a pre‑defined plan or manual input during the workout. We evaluated RISE with 12 participants and compared our system to a non‑adaptive music baseline while exercising in our lab. Participants found our rearrangements keeps intensity estimation accurate, and many recalled moments when intensity alignment helped them push through their workout.
Authors: Wataru Nakata, Yuki Saito, Yota Ueda, Hiroshi Saruwatari
Abstract: Large‑scale text‑to‑speech (TTS) systems are limited by the scarcity of clean, multilingual recordings. We introduce Sidon, a fast, open‑source speech restoration model that converts noisy in‑the‑wild speech into studio‑quality speech and scales to dozens of languages. Sidon consists of two models: w2v‑BERT 2.0 finetuned feature predictor to cleanse features from noisy speech and vocoder trained to synthesize restored speech from the cleansed features. Sidon achieves restoration performance comparable to Miipher: Google's internal speech restoration model with the aim of dataset cleansing for speech synthesis. Sidon is also computationally efficient, running up to 500 times faster than real time on a single GPU. We further show that training a TTS model using a Sidon‑cleansed automatic speech recognition corpus improves the quality of synthetic speech in a zero‑shot setting. Code and model are released to facilitate reproducible dataset cleansing for the research community.
Authors: Kam Man Wu, Zeyue Tian, Liya Ji, Qifeng Chen
Abstract: Video and audio inpainting for mixed audio‑visual content has become a crucial task in multimedia editing recently. However, precisely removing an object and its corresponding audio from a video without affecting the rest of the scene remains a significant challenge. To address this, we propose VAInpaint, a novel pipeline that first utilizes a segmentation model to generate masks and guide a video inpainting model in removing objects. At the same time, an LLM then analyzes the scene globally, while a region‑specific model provides localized descriptions. Both the overall and regional descriptions will be inputted into an LLM, which will refine the content and turn it into text queries for our text‑driven audio separation model. Our audio separation model is fine‑tuned on a customized dataset comprising segmented MUSIC instrument images and VGGSound backgrounds to enhance its generalization performance. Experiments show that our method achieves performance comparable to current benchmarks in both audio and video inpainting.
Authors: Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling
Abstract: This paper presents DAIEN‑TTS, a zero‑shot text‑to‑speech (TTS) framework that enables ENvironment‑aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN‑TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5‑TTS, the proposed DAIEN‑TTS first incorporates a pretrained speech‑environment separation (SES) module to disentangle the environmental speech into mel‑spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel‑spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel‑spectrogram, enabling the simultaneous continuation of personalized speech and time‑varying environmental audio. To further enhance controllability during inference, we adopt dual classifier‑free guidance (DCFG) for the speech and environment components and introduce a signal‑to‑noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN‑TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.
Authors: Liting Gao, Yi Yuan, Yaru Chen, Yuelan Cheng, Zhenbo Li, Juan Wen, Shubin Zhang, Wenwu Wang
Abstract: Diffusion models have shown remarkable progress in text‑to‑audio generation. However, text‑guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training‑based and zero‑shot methods that rely on full‑caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end‑to‑end efficient rectified flow matching‑based diffusion framework for audio editing, and construct a dataset featuring overlapping multi‑event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.
Authors: Adhiraj Banerjee, Vipul Arora
Abstract: Text‑guided sound separation enables flexible audio editing, assistive listening, and open‑domain source extraction, but systems such as AudioSep remain too expensive for low‑latency edge or codec‑mediated deployment. Existing neural audio codec separators are efficient, yet largely restricted to fixed stems or closed taxonomies. We introduce CodecSep, a prompt‑driven universal sound separation framework that extracts sources directly in neural audio codec latent space. CodecSep combines a frozen DAC backbone with a lightweight FiLM‑conditioned Transformer masker driven by CLAP text embeddings, enabling open‑vocabulary separation while preserving codec‑native efficiency.
Across dnr‑v2 and five open‑domain benchmarks, CodecSep consistently improves over AudioSep in SI‑SDR, remains competitive in ViSQOL, and achieves clear gains in human MOS‑LQS. Controlled analyses show that fine‑grained prompts outperform coarse labels, and that explicit latent masking is substantially more effective than decoder‑style latent generation in codec space. Qualitative diagnostics show that neural audio codec latents retain source‑dependent structure, which CodecSep exploits mainly through channel‑wise source‑conditioned modulation.
CodecSep also provides a practical code‑stream deployment path. When audio is transmitted as neural audio codec codes, CodecSep maps codes to embeddings, separates directly in codec space, and outputs waveforms or re‑quantized codes, avoiding the decode‑separate‑re‑encode loop. In this regime, CodecSep requires only 1.35 GMACs end‑to‑end: about 54 times less compute than AudioSep in the same pipeline and 25 times lower separator‑only compute, with much lower latency and memory. More broadly, CodecSep offers a blueprint for codec‑native downstream audio processing.
Authors: Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal
Abstract: Editing complex real‑world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill‑in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder‑decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real‑world backgrounds. Evaluation reveals the importance of each part of the edit descriptions ‑‑ action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.
Authors: Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv
Abstract: Speech super‑resolution (SR) reconstructs high‑fidelity wideband speech from low‑resolution inputs‑a task that necessitates reconciling global harmonic coherence with local transient sharpness. While diffusion‑based generative models yield impressive fidelity, their practical deployment is often stymied by prohibitive computational demands. Conversely, efficient time‑domain architectures lack the explicit frequency representations essential for capturing long‑range spectral dependencies and ensuring precise harmonic alignment. We introduce STSR, a unified end‑to‑end framework formulated in the MDCT domain to circumvent these limitations. STSR employs a Spectral‑Contextual Attention mechanism that harnesses hierarchical windowing to adaptively aggregate non‑local spectral context, enabling consistent harmonic reconstruction up to 48 kHz. Concurrently, a sparse‑aware regularization strategy is employed to mitigate the suppression of transient components inherent in compressed spectral representations. STSR consistently outperforms state‑of‑the‑art baselines in both perceptual fidelity and zero‑shot generalization, providing a robust, real‑time paradigm for high‑quality speech restoration.
Authors: Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang, Haodi Zhang
Abstract: Multimodal Large Language Models (MLLMs) have been widely applied in speech and music. This tendency has led to a focus on audio tokenization for Large Models (LMs). Unlike semantic‑only text tokens, audio tokens must both capture global semantic content and preserve fine‑grained acoustic details. Moreover, they provide a discrete method for speech and music that can be effectively integrated into MLLMs. However, existing research is unsuitable in the definitions of semantic tokens and acoustic tokens. In addition, the evaluation of different codecs typically concentrates on specific domains or tasks, such as reconstruction or Automatic Speech Recognition (ASR) task, which prevents fair and comprehensive comparisons. To address these problems, this paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework. This framework allows for a comprehensive assessment of codecs' capabilities which evaluate across four dimensions: audio reconstruction metric, codebook index (ID) stability, decoder‑only transformer perplexity, and performance on downstream probe tasks. Our results show the correctness of the provided suitable definitions and the correlation among reconstruction metrics, codebook ID stability, downstream probe tasks and perplexity.
Authors: Kevin Putra Santoso, Rizka Wakhidatus Sholikah, Raden Venantius Hari Ginardi
Abstract: High‑quality audio is essential in a wide range of applications, including online communication, virtual assistants, and the multimedia industry. However, degradation caused by noise, compression, and transmission artifacts remains a major challenge. While diffusion models have proven effective for audio restoration, they typically require significant computational resources and struggle to handle longer missing segments. This study introduces WaveLLDM (Wave Lightweight Latent Diffusion Model), an architecture that integrates an efficient neural audio codec with latent diffusion for audio restoration and denoising. Unlike conventional approaches that operate in the time or spectral domain, WaveLLDM processes audio in a compressed latent space, reducing computational complexity while preserving reconstruction quality. Empirical evaluations on the Voicebank+DEMAND test set demonstrate that WaveLLDM achieves accurate spectral reconstruction with low Log‑Spectral Distance (LSD) scores (0.48 to 0.60) and good adaptability to unseen data. However, it still underperforms compared to state‑of‑the‑art methods in terms of perceptual quality and speech clarity, with WB‑PESQ scores ranging from 1.62 to 1.71 and STOI scores between 0.76 and 0.78. These limitations are attributed to suboptimal architectural tuning, the absence of fine‑tuning, and insufficient training duration. Nevertheless, the flexible architecture that combines a neural audio codec and latent diffusion model provides a strong foundation for future development.
Authors: Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu
Abstract: Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post‑training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under‑explored. This work investigates the challenges of applying preference‑based post‑training to this task, focusing on how to define a robust preference signal and curate high‑quality data to avoid reward hacking. To address these challenges, we propose a multi‑metric preference alignment strategy. We construct a new dataset, GenSR‑Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow‑matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi‑metric strategy over single‑metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ''data annotators'', generating high‑quality pseudo‑labels to serve as a supervision signal for traditional discriminative models in data‑scarce scenarios like singing voice restoration. Demo Page:https://gensr‑pref.github.io
Authors: Xiaoxu Zhu, Xiaojie Yu, Guangchao Yao, Yiming Ren, Baoxiang Li
Abstract: Finite Scalar Quantization (FSQ) offers simplified training but suffers from residual magnitude decay in multi‑stage settings, where subsequent stages receive exponentially weaker signals. We propose Robust Residual Finite Scalar Quantization (RFSQ), addressing this fundamental limitation through two novel conditioning strategies: learnable scaling factors and invertible layer normalization. Our experiments across audio and image modalities demonstrate RFSQ's effectiveness and generalizability. In audio reconstruction at 24 bits/frame, RFSQ‑LayerNorm achieves 3.646 DNSMOS, a 3.6% improvement over state‑of‑the‑art RVQ (3.518). On ImageNet, RFSQ achieves 0.102 L1 loss and 0.100 perceptual loss, with LayerNorm providing 9.7% L1 improvement and 17.4% perceptual improvement over unconditioned variants. The LayerNorm strategy consistently outperforms alternatives by maintaining normalized input statistics across stages, effectively preventing exponential magnitude decay that limits naive residual approaches. RFSQ combines FSQ's simplicity with multi‑stage quantization's representational power, establishing a new standard for neural compression across diverse modalities.
Authors: Yisu Liu, Chenxing Li, Wanqian Zhang, Wenfu Wang, Meng Yu, Ruibo Fu, Zheng Lin, Weiping Wang, Dong Yu
Abstract: Controllable text‑to‑audio generation aims to synthesize audio from textual descriptions while satisfying user‑specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade‑offs among accurate temporal localization, open‑vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph‑guided diffusion transformer framework for open‑vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter‑event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high‑quality and diverse training data, we introduce a quality‑balanced data selection pipeline that combines hierarchical event annotation with multi‑criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state‑of‑the‑art performances across a variety of objective and subjective evaluation metrics.
Authors: Soo-Whan Chung, Min-Seok Choi
Abstract: This paper introduces a novel approach to speech restoration by integrating a context‑related conditioning strategy. Specifically, we employ the diffusion‑based generative restoration model, UNIVERSE++, as a backbone to evaluate the effectiveness of contextual representations. We incorporate acoustic context embeddings extracted from the CLAP model, which capture the environmental attributes of input audio. Additionally, we propose an Acoustic Context (ACX) representation that refines CLAP embeddings to better handle various distortion factors and their intensity in speech signals. Unlike content‑based approaches that rely on linguistic and speaker attributes, ACX provides contextual information that enables the restoration model to distinguish and mitigate distortions better. Experimental results indicate that context‑aware conditioning improves both restoration performance and its stability across diverse distortion conditions, reducing variability compared to content‑based methods.
Authors: Isha Pandey, Pranav Gaikwad, Amruta Parulekar, Ganesh Ramakrishnan
Abstract: High‑quality speech generation for low‑resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data‑scarce landscape of India. We train a non‑autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero‑shot, speaker‑specific generation. Our comparative analysis on speech‑infilling tasks reveals nuanced trade‑offs: infilling based predictors improve intelligibility in some languages, while speaker‑prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low‑resource, multilingual settings.
Authors: Jinbo Hu, Yin Cao, Ming Wu, Zhenbo Luo, Jun Yang
Abstract: Spatial audio understanding is essential for accurately perceiving and interpreting acoustic environments. However, existing audio‑language models exhibit limitations in processing spatial audio and perceiving spatial acoustic scenes. To address this gap, we propose the Spatial Audio Language Model (SALM), a novel framework that bridges spatial audio and language through multi‑modal contrastive learning. SALM integrates a text encoder with a dual‑branch audio encoder that decomposes spatial sound into semantic and spatial components via structured audio embeddings. Key features of SALM include seamless alignment between spatial audio and natural language, both separate and joint extraction of spatial and semantic representations, zero‑shot direction classification, and flexible support for spatial audio editing. Experimental results demonstrate that SALM effectively captures and aligns cross‑modal representations, yielding well‑structured audio embeddings. Furthermore, SALM enables advanced editing capabilities, such as modifying directional audio using text‑based embeddings.
Authors: David Fiala, Laurent Pugin, Marnix van Berchum, Martha Thomae, Kévin Roger
Abstract: The Ricercar Lab ‑ the musicological research team at the Center for advanced Studies in the Renaissance at the University of Tours ‑ has decided to make available in open access, thanks to the support of the French digital infrastructure Biblissima, a large corpus of about 3500 XML files of 15th‑c. music. This corpus was produced by the German musicologist Clemens Goldberg who encoded since 2010 onwards the musical content of 34 major 15th‑c. music manuscripts and other complementary files, in order to offer on his foundation's website PDF files of complete collections of works by Du Fay, Binchois, Okeghem, Busnoys and most of their major contemporaries, focusing on their secular output. This corpus was encoded in an XML format named CMME (Computerized Mensural Music Editing), specifically conceived for mensural music by Theodor Dumitrescu in the 2000s, together with editorial and publication tools which have not been updated since then. This article focuses on the development of a set of conversion tools for these CMME files to meet more up‑to‑date standards of music encoding, namely MEI. A workshop was organised in September 2024 at the Campus Condorcet in Paris, gathering experts with a wide range of knowledge on mensural music notation, XML formats and programming. A converter was developped directly in the open‑source rendering library Verovio, allowing the conversion from CMME to MEI mensural. A conversion to MEI CMN was implemented afterwards, enabling to load these files in common engraving softwares such as MuseScore with minimal loss of information. With the availability of a direct import of CMME‑XML into Verovio, the corpus of existing CMME files gets a new life. Furthermore, since the stand‑alone CMME editor still works fine and no alternative is available yet for native MEI, the converter offers a new pipeline for encoding and editing mensural music.
Authors: Karim Benharrak, Puyuan Peng, Amy Pavel
Abstract: Millions of people listen to podcasts, audio stories, and lectures, but editing speech remains tedious and time‑consuming. Creators remove unnecessary words, cut tangential discussions, and even re‑record speech to make recordings concise and engaging. Prior work automatically summarized speech by removing full sentences (extraction), but rigid extraction limits expressivity. AI tools can summarize then re‑synthesize speech (abstraction), but abstraction strips the speaker's style. We present TalkLess, a system that flexibly combines extraction and abstraction to condense speech while preserving its content and style. To edit speech, TalkLess first generates possible transcript edits, selects edits to maximize compression, coverage, and audio quality, then uses a speech editing model to translate transcript edits into audio edits. TalkLess's interface provides creators control over automated edits by separating low‑level wording edits (via the compression pane) from major content edits (via the outline pane). TalkLess achieves higher coverage and removes more speech errors than a state‑of‑the‑art extractive approach. A comparison study (N=12) showed that TalkLess significantly decreased cognitive load and editing effort in speech editing. We further demonstrate TalkLess's potential in an exploratory study (N=3) where creators edited their own speech.
Authors: Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani
Abstract: Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion‑based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre‑trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative‑based regularization loss that enforces smooth temporal dynamics, and a span‑based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Visit our project page for examples and code.
Authors: Taewoo Kim, Uijong Lee, Hayoung Park, Choongsang Cho, Nam In Park, Young Han Lee
Abstract: Speech editing systems aim to naturally modify speech content while preserving acoustic consistency and speaker identity. However, previous studies often struggle to adapt to unseen and diverse acoustic conditions, resulting in degraded editing performance in real‑world scenarios. To address this, we propose an instance‑specific test‑time training method for speech editing in the wild. Our approach employs direct supervision from ground‑truth acoustic features in unedited regions and indirect supervision in edited regions via auxiliary losses based on duration constraints and phoneme prediction. This strategy mitigates the bandwidth discontinuity problem in speech editing, ensuring smooth acoustic transitions between unedited and edited regions. Additionally, it enables precise control over speech rate by adapting the model to target durations via mask length adjustment during test‑time training. Experiments on in‑the‑wild benchmark datasets demonstrate that our method outperforms existing speech editing systems in both objective and subjective evaluations.
Authors: Or Tal, Felix Kreuk, Yossi Adi
Abstract: Recent progress in text‑to‑music generation has enabled models to synthesize high‑quality musical segments, full compositions, and even respond to fine‑grained control signals, e.g. chord progressions. State‑of‑the‑art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade‑offs and emergent behaviors that can guide future text‑to‑music generation systems. Specifically, we compare the two arguably most common modeling paradigms: auto‑regressive decoding and conditional flow‑matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text‑to‑music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM
Authors: Kuiyuan Zhang, Wenjie Pei, Rushi Lan, Yifang Guo, Zhongyun Hua
Abstract: Deepfakes are AI‑synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio‑visual deepfakes, previous studies commonly employ two relatively independent sub‑models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the inherent correlations between audio and visual features. Moreover, utilizing two isolated feature learning sub‑models can result in redundant neural layers, making the overall model inefficient and impractical for resource‑constrained environments.
In this work, we design a lightweight network for audio‑visual deepfake detection via a single‑stream multi‑modal learning framework. Specifically, we introduce a collaborative audio‑visual learning block to efficiently integrate multi‑modal information while learning the visual and audio features. By iteratively employing this block, our single‑stream network achieves a continuous fusion of multi‑modal features across its layers. Thus, our network efficiently captures visual and audio features without the need for excessive block stacking, resulting in a lightweight network design. Furthermore, we propose a multi‑modal classification module that can boost the dependence of the visual and audio classifiers on modality content. It also enhances the whole resistance of the video classifier against the mismatches between audio and visual modalities. We conduct experiments on the DF‑TIMIT, FakeAVCeleb, and DFDC benchmark datasets. Compared to state‑of‑the‑art audio‑visual joint detection methods, our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni‑modal and multi‑modal deepfakes, as well as in unseen types of deepfakes.
Authors: Ben Hayes, Charalampos Saitis, György Fazekas
Abstract: Many audio synthesizers can produce the same signal given different parameter configurations, meaning the inversion from sound to parameters is an inherently ill‑posed problem. We show that this is largely due to intrinsic symmetries of the synthesizer, and focus in particular on permutation invariance. First, we demonstrate on a synthetic task that regressing point estimates under permutation symmetry degrades performance, even when using a permutation‑invariant loss function or symmetry‑breaking heuristics. Then, viewing equivalent solutions as modes of a probability distribution, we show that a conditional generative model substantially improves performance. Further, acknowledging the invariance of the implicit parameter distribution, we find that performance is further improved by using a permutation equivariant continuous normalizing flow. To accommodate intricate symmetries in real synthesizers, we also propose a relaxed equivariance strategy that adaptively discovers relevant symmetries from data. Applying our method to Surge XT, a full‑featured open source synthesizer used in real world audio production, we find our method outperforms regression and generative baselines across audio reconstruction metrics.
Authors: You Zhang, Baotong Tian, Lin Zhang, Zhiyao Duan
Abstract: Neural speech editing enables seamless partial edits to speech utterances, allowing modifications to selected content while preserving the rest of the audio unchanged. This useful technique, however, also poses new risks of deepfakes. To encourage research on detecting such partially edited deepfake speech, we introduce PartialEdit, a deepfake speech dataset curated using advanced neural editing techniques. We explore both detection and localization tasks on PartialEdit. Our experiments reveal that models trained on the existing PartialSpoof dataset fail to detect partially edited speech generated by neural speech editing models. As recent speech editing models almost all involve neural audio codecs, we also provide insights into the artifacts the model learned on detecting these deepfakes. Further information about the PartialEdit dataset and audio samples can be found on the project page: https://yzyouzhang.com/PartialEdit/index.html.
Authors: Zixun Guo, Simon Dixon
Abstract: Moonbeam is a transformer‑based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music‑domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain‑knowledge‑inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large‑scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI‑like tokenizer. We open‑source the code, pretrained model, and generated samples on Github.
Authors: Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon
Abstract: We propose a speech enhancement system that combines speaker‑agnostic speech restoration with voice conversion (VC) to obtain a studio‑level quality speech signal. While voice conversion models are typically used to change speaker characteristics, they can also serve as a means of speech restoration when the target speaker is the same as the source speaker. However, since VC models are vulnerable to noisy conditions, we have included a generative speech restoration (GSR) model at the front end of our proposed system. The GSR model performs noise suppression and restores speech damage incurred during that process without knowledge about the target speaker. The VC stage then uses guidance from clean speaker embeddings to further restore the output speech. By employing this two‑stage approach, we have achieved speech quality objective metric scores comparable to state‑of‑the‑art (SOTA) methods across multiple datasets.
Authors: Kuan-Yu Chen, Jeng-Lin Li, De-Yan Lu, Jian-Jiun Ding
Abstract: With the fast development of zero‑shot text‑to‑speech technologies, it is possible to generate high‑quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real‑world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise‑resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency‑band‑aware noise suppression module and an in‑content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state‑of‑the‑art approaches in multiple quantitative and qualitative evaluations.
Authors: Li Zhang
Abstract: While recent advancements in AI music generation have predominantly focused on direct audio synthesis, these systems suffer from inherent rigidity, limiting their utility for professional music producers who require granular, highly malleable creative control. Symbolic music (e.g., MIDI) resolves this constraint by providing editable note‑level parameters, yet the natural progression to instruction‑driven symbolic music editing remains critically under‑explored due to a severe scarcity of paired instruction‑MIDI datasets. In this paper, we bypass this data bottleneck by formalizing zero‑shot symbolic music editing as a structured reasoning task. We introduce a novel text‑based "drumroll" notation that translates musical mechanics into a spatial, syntax‑driven grid, empowering off‑the‑shelf Large Language Models (LLMs) to logically deduce and apply complex edits to drum grooves using only zero‑shot prompting. To rigorously evaluate this paradigm, we propose Not that Groove, a comprehensive benchmark comprising thousands of drum grooves paired with specific, descriptive, and stylistic natural language instructions. Crucially, to overcome the prohibitive cost and subjectivity of human musical evaluation, we introduce a scalable, domain‑informed automated unit‑testing framework that symbolically verifies whether an edited groove satisfies the core constraints of the user's request. Our extensive experiments across eight state‑of‑the‑art LLMs demonstrate the high efficacy of this approach, with the top‑performing model achieving a 68% success rate on our automated unit tests. Furthermore, listening tests confirm that our programmatic unit tests align highly with the subjective judgments of professional musicians, establishing a robust, data‑efficient, and scalable foundation for the future of controllable AI music production.
Authors: Wataru Nakata, Yuma Koizumi, Shigeki Karita, Robin Scheibler, Haruko Ishikawa, Adriana Guevara-Rukoz, Heiga Zen, Michiel Bacchiani
Abstract: Reverberation encodes spatial information regarding the acoustic source environment, yet traditional Speech Restoration (SR) usually completely removes reverberation. We propose ReverbMiipher, an SR model extending parametric resynthesis framework, designed to denoise speech while preserving and enabling control over reverberation. ReverbMiipher incorporates a dedicated ReverbEncoder to extract a reverb feature vector from noisy input. This feature conditions a vocoder to reconstruct the speech signal, removing noise while retaining the original reverberation characteristics. A stochastic zero‑vector replacement strategy during training ensures the feature specifically encodes reverberation, disentangling it from other speech attributes. This learned representation facilitates reverberation control via techniques such as interpolation between features, replacement with features from other utterances, or sampling from a latent space. Objective and subjective evaluations confirm ReverbMiipher effectively preserves reverberation, removes other artifacts, and outperforms the conventional two‑stage SR and convolving simulated room impulse response approach. We further demonstrate its ability to generate novel reverberation effects through feature manipulation.
Authors: Shigeki Karita, Yuma Koizumi, Heiga Zen, Haruko Ishikawa, Robin Scheibler, Michiel Bacchiani
Abstract: Training data cleaning is a new application for generative model‑based speech restoration (SR). This paper introduces Miipher‑2, an SR model designed for million‑hour scale data, for training data cleaning for large‑scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher‑2 utilizes a frozen, pre‑trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning‑free feature extractor. To optimize efficiency and minimize memory, Miipher‑2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi‑lingual, studio‑quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher‑2's superior or comparable performance to conventional SR models in word‑error‑rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher‑2 operates efficiently on consumer‑grade accelerators, achieving a real‑time factor of 0.0078, enabling the processing of a million‑hour speech dataset in approximately three days using only 100 such accelerators.
Authors: Da-Hee Yang, Jaeuk Lee, Joon-Hyuk Chang
Abstract: We introduce FLOWER, a novel conditioning method designed for speech restoration that integrates Gaussian guidance into generative frameworks. By transforming clean speech into a predefined prior distribution (e.g., Gaussian distribution) using a normalizing flow network, FLOWER extracts critical information to guide generative models. This guidance is incorporated into each block of the generative network, enabling precise restoration control. Experimental results demonstrate the effectiveness of FLOWER in improving performance across various general speech restoration tasks.
Authors: Minwoo Oh, Minsu Park, Eunil Park
Abstract: Short video platforms like YouTube Shorts and TikTok face significant copyright compliance challenges, as infringers frequently embed arbitrary background music (BGM) to obscure original soundtracks (OST) and evade content originality detection. To tackle this issue, we propose a novel pipeline that integrates Music Source Separation (MSS) and cross‑modal video‑music retrieval (CMVMR). Our approach effectively separates arbitrary BGM from the original OST, enabling the restoration of authentic video audio tracks. To support this work, we introduce two domain‑specific datasets: OASD‑20K for audio separation and OSVAR‑160 for pipeline evaluation. OASD‑20K contains 20,000 audio clips featuring mixed BGM and OST pairs, while OSVAR‑160 is a unique benchmark dataset comprising 1,121 video and mixed‑audio pairs, specifically designed for short video restoration tasks. Experimental results demonstrate that our pipeline not only removes arbitrary BGM with high accuracy but also restores OSTs, ensuring content integrity. This approach provides an ethical and scalable solution to copyright challenges in user‑generated content on short video platforms.
Authors: Mahya Khazaei, Ali Bahrani, George Tzanetakis
Abstract: We introduce a real‑time, human‑in‑the‑loop gesture control framework that can dynamically adapt audio and music based on human movement by analyzing live video input. By creating a responsive connection between visual and auditory stimuli, this system enables dancers and performers to not only respond to music but also influence it through their movements. Designed for live performances, interactive installations, and personal use, it offers an immersive experience where users can shape the music in real time.
The framework integrates computer vision and machine learning techniques to track and interpret motion, allowing users to manipulate audio elements such as tempo, pitch, effects, and playback sequence. With ongoing training, it achieves user‑independent functionality, requiring as few as 50 to 80 samples to label simple gestures. This framework combines gesture training, cue mapping, and audio manipulation to create a dynamic, interactive experience. Gestures are interpreted as input signals, mapped to sound control commands, and used to naturally adjust music elements, showcasing the seamless interplay between human interaction and machine response.
Authors: Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji
Abstract: Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero‑shot text‑guided editing methods rely on pretrained diffusion models by involving forward‑backward diffusion processes. However, these methods often struggle to preserve the musical content. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that improve the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse‑grained zero‑shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine‑grained personalized music editing by manipulating a concept token that represents a user‑defined musical style. SteerMusic+ allows for the editing of music into user‑defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality.
Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei
Abstract: Deepfakes have become a universal and rapidly intensifying concern of generative AI across various media types such as images, audio, and videos. Among these, audio deepfakes have been of particular concern due to the ease of high‑quality voice synthesis and distribution via platforms such as social media and robocalls. Consequently, detecting audio deepfakes plays a critical role in combating the growing misuse of AI‑synthesized speech. However, real‑world scenarios often introduce various audio corruptions, such as noise, modification, and compression, that may significantly impact detection performance. This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. Using both traditional deep learning models and state‑of‑the‑art foundation models, we make four unique observations. First, our findings show that while most models demonstrate strong robustness to noise, they are notably more vulnerable to modifications and compression, especially when neural codecs are applied. Second, speech foundation models generally outperform traditional models across most scenarios, likely due to their self‑supervised learning paradigm and large‑scale pre‑training. Third, our results show that increasing model size improves robustness, albeit with diminishing returns. Fourth, we demonstrate how targeted data augmentation during training can enhance model resilience to unseen perturbations. A case study on political speech deepfakes highlights the effectiveness of foundation models in achieving high accuracy under real‑world conditions. These findings emphasize the importance of developing more robust detection frameworks to ensure reliability in practical deployment settings.
Authors: Yang Chen, Hui Wang, Shiyao Wang, Junyang Chen, Jiabei He, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin
Abstract: While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly‑specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super‑aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.
Authors: Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda
Abstract: We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing style, we use an audio infilling task by predicting a masked segment of the target mel‑spectrogram with a flow‑matching model using the complement of the masked target mel‑spectrogram along with disentangled acoustic features. On the other hand, to disentangle the source singing style, we use a cyclic training approach, where we use synthetic converted samples as source inputs and reconstruct the original source mel‑spectrogram as a target. Finally, to retain the source melody better, we investigate a post‑processing module using a source‑filter‑based vocoder and resynthesize the converted waveforms using the original F0 patterns. Our results showed that the Serenade framework can handle generalized SSC tasks with the best overall similarity score, especially in modeling breathy and mixed singing styles. We also found that resynthesizing with the original F0 patterns alleviated out‑of‑tune singing and improved naturalness, but found a slight tradeoff in similarity due to not changing the F0 patterns into the target style.
Authors: Yidi Jiang, Qian Chen, Shengpeng Ji, Yu Xi, Wen Wang, Chong Zhang, Xianghu Yue, ShiLiang Zhang, Haizhou Li
Abstract: The emergence of audio language models is empowered by neural audio codecs, which establish critical mappings between continuous waveforms and discrete tokens compatible with language model paradigms. The evolutionary trends from multi‑layer residual vector quantizer to single‑layer quantizer are beneficial for language‑autoregressive decoding. However, the capability to handle multi‑domain audio signals through a single codebook remains constrained by inter‑domain distribution discrepancies. In this work, we introduce UniCodec, a unified audio codec with a single codebook to support multi‑domain audio data, including speech, music, and sound. To achieve this, we propose a partitioned domain‑adaptive codebook method and domain Mixture‑of‑Experts strategy to capture the distinct characteristics of each audio domain. Furthermore, to enrich the semantic density of the codec without auxiliary modules, we propose a self‑supervised mask prediction modeling approach. Comprehensive objective and subjective evaluations demonstrate that UniCodec achieves excellent audio reconstruction performance across the three audio domains, outperforming existing unified neural codecs with a single codebook, and even surpasses state‑of‑the‑art domain‑specific codecs on both acoustic and semantic representation capabilities.
Authors: Shreya Shukla, Jose Torres, Akshaj Murhekar, Christina Liu, Abhijit Mishra, Jacek Gwizdka, Shounak Roychowdhury
Abstract: Decoding neural activity into human‑interpretable representations is a key research direction in brain‑computer interfaces (BCIs) and computational neuroscience. Recent progress in machine learning and generative AI has driven growing interest in transforming non‑invasive Electroencephalography (EEG) signals into images, text, and audio. This survey consolidates and analyzes developments across EEG‑to‑image synthesis, EEG‑to‑text generation, and EEG‑to‑audio reconstruction. We conducted a structured literature search across major databases (2017‑2025), extracting key information on datasets, generative architectures (GANs, VAEs, transformers, diffusion models), EEG feature‑encoding techniques, evaluation metrics, and the major challenges shaping current work in this area. Our review finds that EEG‑to‑image models predominantly employ encoder‑decoder architectures built on GANs, VAEs, or diffusion models; EEG‑to‑text approaches increasingly leverage transformer‑based language models for open‑vocabulary decoding; and EEG‑to‑audio methods commonly map EEG signals to mel‑spectrograms that are subsequently rendered into audio using neural vocoders. Despite promising advances, the field remains constrained by small and heterogeneous datasets, limited cross‑subject generalization, and the absence of standardized benchmarks. By consolidating methodological trends and available datasets, this survey provides a foundational reference for advancing EEG‑based generative AI and supporting reproducible research. We further highlight open‑source datasets and baseline implementations to facilitate systematic benchmarking and accelerate progress in EEG‑driven neural decoding.
Authors: Mohd. Farhan Israk Soumik, W. K. M. Mithsara, Abdur R. Shahid, Ahmed Imteaj
Abstract: The rapid proliferation of speech‑enabled technologies, including virtual assistants, video conferencing platforms, and wearable devices, has raised significant privacy concerns, particularly regarding the inference of sensitive emotional information from audio data. Existing privacy‑preserving methods often compromise usability and security, limiting their adoption in practical scenarios. This paper introduces a novel, user‑centric approach that leverages familiar audio editing techniques, specifically pitch and tempo manipulation, to protect emotional privacy without sacrificing usability. By analyzing popular audio editing applications on Android and iOS platforms, we identified these features as both widely available and usable. We rigorously evaluated their effectiveness against a threat model, considering adversarial attacks from diverse sources, including Deep Neural Networks (DNNs), Large Language Models (LLMs), and and reversibility testing. Our experiments, conducted on three distinct datasets, demonstrate that pitch and tempo manipulation effectively obfuscates emotional data. Additionally, we explore the design principles for lightweight, on‑device implementation to ensure broad applicability across various devices and platforms.
Authors: Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
Abstract: Large Language Models (LLMs) demonstrate impressive zero‑shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant safety problems, as models can be exploited to generate harmful or inappropriate content through jailbreak attacks. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio‑specific jailbreak on Large Audio‑Language Models (LALMs) remains largely underexplored. To address this gap, we introduce Jailbreak‑AudioBench, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text‑to‑audio conversion but also various editing techniques for injecting audio hidden semantics. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state‑of‑the‑art LALMs and establish the most comprehensive Jailbreak benchmark to date for audio modality. Finally, Jailbreak‑AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in‑depth exposure of more powerful jailbreak threats, such as query‑based audio editing, and by facilitating the development of effective defense mechanisms.
Authors: Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro
Abstract: Real‑world audio is often degraded by numerous factors. This work presents an audio restoration model tailored for high‑res music at 44.1kHz. Our model, Audio‑to‑Audio Schrödinger Bridges (A2SB), is capable of both bandwidth extension (predicting high‑frequency components) and inpainting (re‑generating missing segments). Critically, A2SB is end‑to‑end requiring no vocoder to predict waveform outputs, able to restore hour‑long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state‑of‑the‑art band‑width extension and inpainting quality on several out‑of‑distribution music test sets.
Authors: Darius Afchar, Gabriel Meseguer-Brocal, Romain Hennequin
Abstract: In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. In particular, the ability to create credible minute‑long synthetic music in a few seconds on user‑friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and artificial reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a AI‑music detector, a tool that will help in the regulation of synthetic media. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that getting a good test score is not the end of the story. We expose and discuss several facets that could be problematic with such a deployed detector: robustness to audio manipulation, generalisation to unseen models. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of artificial content checkers.
Authors: Chang Li, Zehua Chen, Fan Bao, Jun Zhu
Abstract: Speech super‑resolution (SR), which generates a waveform at a higher sampling rate from its low‑resolution version, is a long‑standing critical task in speech restoration. Previous works have explored speech SR in different data spaces, but these methods either require additional compression networks or exhibit limited synthesis quality and inference speed. Motivated by recent advances in probabilistic generative models, we present Bridge‑SR, a novel and efficient any‑to‑48kHz SR system in the speech waveform domain. Using tractable Schrödinger Bridge models, we leverage the observed low‑resolution waveform as a prior, which is intrinsically informative for the high‑resolution target. By optimizing a lightweight network to learn the score functions from the prior to the target, we achieve efficient waveform SR through a data‑to‑data generation process that fully exploits the instructive content contained in the low‑resolution observation. Furthermore, we identify the importance of the noise schedule, data scaling, and auxiliary loss functions, which further improve the SR quality of bridge‑based systems. The experiments conducted on the benchmark dataset VCTK demonstrate the efficiency of our system: (1) in terms of sample quality, Bridge‑SR outperforms several strong baseline methods under different SR settings, using a lightweight network backbone (1.7M); (2) in terms of inference speed, our 4‑step synthesis achieves better performance than the 8‑step conditional diffusion counterpart (LSD: 0.911 vs 0.927). Demo at https://bridge‑sr.github.io.
Authors: Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Chao-Han Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu
Abstract: Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut‑and‑paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript3T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re‑implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut‑and‑paste methods. Despite human difficulty, experimental results demonstrate that self‑supervised‑based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.
Authors: Stanislav Kirdey
Abstract: We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow‑matching Transformers trained in a self‑supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long‑form speech recordings, including background noise, reverberation, compression artifacts, and bandwidth limitations ‑ all within a single, unified model. Leveraging conditional flow matching and classifier free guidance, the model learns to map degraded speech to high quality recordings without requiring paired clean and degraded datasets. We describe the training process, the conditional flow matching framework, and the model's architecture. We also demonstrate the model's generalization to real‑world speech restoration tasks, including both short utterances and extended monologues or dialogues. Qualitative and quantitative evaluations show that our approach provides a flexible and effective solution for enhancing the quality of speech recordings across varying lengths and degradation types.
Authors: Hugo Flores García, Oriol Nieto, Justin Salamon, Bryan Pardo, Prem Seetharaman
Abstract: We present Sketch2Sound, a generative audio model capable of creating high‑quality sounds from a set of interpretable time‑varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound‑shape). Sketch2Sound can be implemented on top of any text‑to‑audio latent diffusion transformer (DiT), and requires only 40k steps of fine‑tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text‑only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation. Sound examples are available at https://hugofloresgarcia.art/sketch2sound/.
Authors: Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan
Abstract: Research on large language models has advanced significantly across text, speech, images, and videos. However, multi‑modal music understanding and generation remain underexplored due to the lack of well‑annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi‑modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu‑LLaMA, a model that leverages pre‑trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks‑‑music understanding, text‑to‑music generation, prompt‑based music editing, and multi‑modal music generation‑‑demonstrates that MuMu‑LLaMA outperforms state‑of‑the‑art models, showing its potential for multi‑modal music applications.
Authors: Yixiao Zhang
Abstract: The field of AI‑assisted music creation has made significant strides, yet existing systems often struggle to meet the demands of iterative and nuanced music production. These challenges include providing sufficient control over the generated content and allowing for flexible, precise edits. This thesis tackles these issues by introducing a series of advancements that progressively build upon each other, enhancing the controllability and editability of text‑to‑music generation models.
First, we introduce Loop Copilot, a system that tries to address the need for iterative refinement in music creation. Loop Copilot leverages a large language model (LLM) to coordinate multiple specialised AI models, enabling users to generate and refine music interactively through a conversational interface. Central to this system is the Global Attribute Table, which records and maintains key musical attributes throughout the iterative process, ensuring that modifications at any stage preserve the overall coherence of the music. While Loop Copilot excels in orchestrating the music creation process, it does not directly address the need for detailed edits to the generated content.
To overcome this limitation, MusicMagus is presented as a further solution for editing AI‑generated music. MusicMagus introduces a zero‑shot text‑to‑music editing approach that allows for the modification of specific musical attributes, such as genre, mood, and instrumentation, without the need for retraining. By manipulating the latent space within pre‑trained diffusion models, MusicMagus ensures that these edits are stylistically coherent and that non‑targeted attributes remain unchanged. This system is particularly effective in maintaining the structural integrity of the music during edits, but it encounters challenges with more complex and real‑world audio scenarios.
...
Authors: Jung-Woo Chang, Ke Sun, David Xia, Xinyu Zhang, Farinaz Koushanfar
Abstract: Vibrometry‑based side channels pose a significant privacy risk, exploiting sensors like mmWave radars, light sensors, and accelerometers to detect vibrations from sound sources or proximate objects, enabling speech eavesdropping. Despite various proposed defenses, these involve costly hardware solutions with inherent physical limitations. This paper presents EveGuard, a software‑driven defense framework that creates adversarial audio, protecting voice privacy from side channels without compromising human perception. We leverage the distinct sensing capabilities of side channels and traditional microphones, where side channels capture vibrations and microphones record changes in air pressure, resulting in different frequency responses. EveGuard first proposes a perturbation generator model (PGM) that effectively suppresses sensor‑based eavesdropping while maintaining high audio quality. Second, to enable end‑to‑end training of PGM, we introduce a new domain translation task called Eve‑GAN for inferring an eavesdropped signal from a given audio. We further apply few‑shot learning to mitigate the data collection overhead for Eve‑GAN training. Our extensive experiments show that EveGuard achieves a protection rate of more than 97 percent from audio classifiers and significantly hinders eavesdropped audio reconstruction. We further validate the performance of EveGuard across three adaptive attack mechanisms. We have conducted a user study to verify the perceptual quality of our perturbed audio.
Authors: Guanwen Feng, Zhihao Qian, Yunan Li, Siyu Jin, Qiguang Miao, Chi-Man Pun
Abstract: While existing one‑shot talking head generation models have achieved progress in coarse‑grained emotion editing, there is still a lack of fine‑grained emotion editing models with high interpretability. We argue that for an approach to be considered fine‑grained, it needs to provide clear definitions and sufficiently detailed differentiation. We present LES‑Talker, a novel one‑shot talking head generation model with high interpretability, to achieve fine‑grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross‑Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine‑grained emotion editing, outperforming mainstream methods.
Authors: Haoran Sun, Dominique Fourer, Hichem Maaref
Abstract: Dynamic Range Compression (DRC) is a widely used audio effect that adjusts signal dynamics for applications in music production, broadcasting, and speech processing. Inverting DRC is of broad importance for restoring the original dynamics, enabling remixing, and enhancing the overall audio quality. Existing DRC inversion methods either overlook key parameters or rely on precise parameter values, which can be challenging to estimate accurately. To address this limitation, we introduce a hybrid approach that combines model‑based DRC inversion with neural networks to achieve robust DRC parameter estimation and audio restoration simultaneously. Our method uses tailored neural network architectures (classification and regression), which are then integrated into a model‑based inversion framework to reconstruct the original signal. Experimental evaluations on various music and speech datasets confirm the effectiveness and robustness of our approach, outperforming several state‑of‑the‑art techniques.
Authors: Amer Essakine, Yanqi Cheng, Chun-Wun Cheng, Lipei Zhang, Zhongying Deng, Lei Zhu, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Abstract: Implicit Neural Representations (INRs) have emerged as a paradigm in knowledge representation, offering exceptional flexibility and performance across a diverse range of applications. INRs leverage multilayer perceptrons (MLPs) to model data as continuous implicit functions, providing critical advantages such as resolution independence, memory efficiency, and generalisation beyond discretised data structures. Their ability to solve complex inverse problems makes them particularly effective for tasks including audio reconstruction, image representation, 3D object reconstruction, and high‑dimensional data synthesis. This survey provides a comprehensive review of state‑of‑the‑art INR methods, introducing a clear taxonomy that categorises them into four key areas: activation functions, position encoding, combined strategies, and network structure optimisation. We rigorously analyse their critical properties, such as full differentiability, smoothness, compactness, and adaptability to varying resolutions while also examining their strengths and limitations in addressing locality biases and capturing fine details. Our experimental comparison offers new insights into the trade‑offs between different approaches, showcasing the capabilities and challenges of the latest INR techniques across various tasks. In addition to identifying areas where current methods excel, we highlight key limitations and potential avenues for improvement, such as developing more expressive activation functions, enhancing positional encoding mechanisms, and improving scalability for complex, high‑dimensional data. This survey serves as a roadmap for researchers, offering practical guidance for future exploration in the field of INRs. We aim to foster new methodologies by outlining promising research directions for INRs and applications.
Authors: Shentong Mo, Yibing Song
Abstract: Visual content and accompanied audio signals naturally formulate a joint representation to improve audio‑visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high‑quality representation. We observe that an audio signal may contain background noise interference. Also, non‑synchronization may appear between audio and video streams. These non‑strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data‑centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM‑based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi‑modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state‑of‑the‑art performance of the proposed approach against previous baselines in diverse downstream tasks.
Authors: K R Prajwal, Bowen Shi, Matthew Lee, Apoorv Vyas, Andros Tjandra, Mahi Luthra, Baishan Guo, Huiyu Wang, Triantafyllos Afouras, David Kant, Wei-Ning Hsu
Abstract: We introduce MusicFlow, a cascaded text‑to‑music generation model based on flow matching. Based on self‑supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero‑shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over 2~5 times smaller and requiring 5 times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.
Authors: Yang Chen, Yuhang Jia, Shiwan Zhao, Ziyue Jiang, Haoran Li, Jiarong Kang, Yong Qin
Abstract: As text‑based speech editing becomes increasingly prevalent, the demand for unrestricted free‑text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out‑of‑domain (OOD) text. In this paper, we introduce, DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that interframe smoothing properties are critical for modeling acoustic consistency, and thus we propose a first‑order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech. Experimental results demonstrate that our model achieves state‑of‑the‑art performance in both in‑domain and OOD text scenarios.
Authors: Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu
Abstract: Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high‑frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide‑band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi‑ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi‑ResLDM against state‑of‑the‑art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high‑frequency‑band details. Hi‑ResLDM not only excels in non‑instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high‑resolution speech restoration.
Authors: Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans
Abstract: Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI‑generated and AI‑augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists' workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.
Authors: Da-Hee Yang, Dail Kim, Joon-Hyuk Chang, Jeonghwan Choi, Han-gil Moon
Abstract: We present a novel general speech restoration model, DBP‑Net (dual‑branch parallel network), designed to effectively handle complex real‑world distortions including noise, reverberation, and bandwidth degradation. Unlike prior approaches that rely on a single processing path or separate models for enhancement and restoration, DBP‑Net introduces a unified architecture with dual parallel branches‑a masking‑based branch for distortion suppression and a mapping‑based branch for spectrum reconstruction. A key innovation behind DBP‑Net lies in the parameter sharing between the two branches and a cross‑branch skip fusion, where the output of the masking branch is explicitly fused into the mapping branch. This design enables DBP‑Net to simultaneously leverage complementary learning strategies‑suppression and generation‑within a lightweight framework. Experimental results show that DBP‑Net significantly outperforms existing baselines in comprehensive speech restoration tasks while maintaining a compact model size. These findings suggest that DBP‑Net offers an effective and scalable solution for unified speech enhancement and restoration in diverse distortion scenarios.
Authors: Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu
Abstract: In this paper, we introduce SSR‑Speech, a neural codec autoregressive model designed for stable, safe, and robust zero‑shot textbased speech editing and text‑to‑speech synthesis. SSR‑Speech is built on a Transformer decoder and incorporates classifier‑free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame‑level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state‑of‑the‑art performance in the RealEdit speech editing task and the LibriTTS text‑to‑speech task, surpassing previous methods. Furthermore, SSR‑Speech excels in multi‑span speech editing and also demonstrates remarkable robustness to background sounds. The source code and demos are released.
Authors: Ondřej Mokrý, Peter Balušík, Pavel Rajmic
Abstract: The paper focuses on inpainting missing parts of an audio signal spectrogram, i.e., estimating the lacking time‑frequency coefficients. The autoregression‑based Janssen algorithm, a state‑of‑the‑art for the time‑domain audio inpainting, is adapted for the time‑frequency setting. This novel method, termed Janssen‑TF, is compared with the deep‑prior neural network approach using both objective metrics and a subjective listening test, proving Janssen‑TF to be superior in all the considered measures.
Authors: Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon
Abstract: Noise suppression (NS) algorithms are effective in improving speech quality in many cases. However, aggressive noise suppression can damage the target speech, reducing both speech intelligibility and quality despite removing the noise. This study proposes an explicit speech restoration method using a voice conversion (VC) technique for restoration after noise suppression. We observed that high‑quality speech can be restored through a diffusion‑based voice conversion stage, conditioned on the target speaker embedding and speech content information extracted from the de‑noised speech. This speech restoration can achieve enhancement effects such as bandwidth extension, de‑reverberation, and in‑painting. Our experimental results demonstrate that this two‑stage NS+VC framework outperforms single‑stage enhancement models in terms of output speech quality, as measured by objective metrics, while scoring slightly lower in speech intelligibility. To further improve the intelligibility of the combined system, we propose a content encoder adaptation method for robust content extraction in noisy conditions.