arXiv Papers of Audio Forgery, Manipulation and Deepfake

Abstract:
Audio deepfakes are a growing challenge for the general public, as well as for journalists and fact‑checkers. The latter need reliable tools to verify the authenticity of their sources, while at the same time keeping their information private. Commercial deepfake detection solutions rely on cloud‑based processing, which raises privacy concerns. To solve this problem, we propose an on‑device audio deepfake detection model. We show that a truncated self‑supervised backbone with a simple logistic classifier is both very fast and often more accurate than existing solutions. Our solution outperforms the baseline AASIST by 10% and improves inference speed by 40%. We integrate this model into a browser plug‑in, which allows journalists and fact‑checkers to detect deepfakes easily and securely. Code for the plugin is available at https://github.com/OctavianPascu97/Audio‑Deepfakes‑Browser‑Plugin.

Abstract:
Audio deepfakes generated by neural text‑to‑speech and voice‑cloning systems threaten speaker verification and public discourse at scale. The core challenge is cross‑dataset generalization: detectors trained on one synthesis pipeline collapse on unseen forgeries. We argue that this failure is primarily because of structural synthetic speech artifacts which are multi‑timescale trajectory anomalies. Though every existing detector aggregates a fixed‑window frame statistics, this misaligns the architecture with the signal. We propose FlowFake, a Liquid Time‑Constant (LTC) architecture whose hidden state evolves via a learned ODE, with per‑neuron adaptive time constants simultaneously resolving spectral (10ms) and prosodic (2s) cues. At only 34K parameters FlowFake achieves formal BIBO stability and O(dt^4) integration error. On a four‑dataset cross domain benchmark (ASVspoof2019‑LA, FakeOrReal, InTheWild, MLAAD), FlowFake reaches 75.29% on ASVspoof2019 trained only on FakeOrReal and 79.97% trained only on MLAAD. It outperforms RawGAT‑ST and Whisper‑DF on every evaluated pair and matching SSL Wav2vec2 (300x larger) at 0.01% of its parameter count. The source code is available on : https://github.com/GhostRider2023/FlowFake

Abstract:
Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real‑world scenarios where speech is often mixed with background music or noise. Current state‑of‑the‑art methods rely on semantic features from self‑supervised learning (SSL) models, which often fail when processing non‑speech or mixed‑source audio. In this paper, we first introduce MixFake, a large‑scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic‑centric" limitation, we propose a Multi‑stream Prompt Tuning framework that injects signal‑level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.

Abstract:
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons‑of‑interest (POI) such as public figures. Current detection systems primarily rely on generic, black‑box models that fail to capture speaker‑specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme‑based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro‑utterance analysis to micro‑phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker‑specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data‑efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof‑specific training. Furthermore, we introduce the first large‑scale Chinese POI deepfake dataset to benchmark speaker‑specific detection. Experimental results demonstrate that PVP significantly outperforms state‑of‑the‑art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine‑grained, phoneme‑level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue‑tech/PVP

Abstract:
The rapid advancement of generative models has enabled highly realistic audio deepfakes, yet current detectors suffer from a critical bias problem, leading to poor generalization across unseen datasets. This paper proposes Artifact‑Focused Self‑Synthesis (AFSS), a method designed to mitigate this bias by generating pseudo‑fake samples from real audio via two mechanisms: self‑conversion and self‑reconstruction. The core insight of AFSS lies in enforcing same‑speaker constraints, ensuring that real and pseudo‑fake samples share identical speaker identity and semantic content. This forces the detector to focus exclusively on generation artifacts rather than irrelevant confounding factors. Furthermore, we introduce a learnable reweighting loss to dynamically emphasize synthetic samples during training. Extensive experiments across 7 datasets demonstrate that AFSS achieves state‑of‑the‑art performance with an average EER of 5.45%, including a significant reduction to 1.23% on WaveFake and 2.70% on In‑the‑Wild, all while eliminating the dependency on pre‑collected fake datasets. Our code is publicly available at https://github.com/NguyenLeHaiSonGit/AFSS.

Abstract:
The rapid progress of generative AI has enabled hyper‑realistic audio‑visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni‑modal artifacts or audio‑visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator‑specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio‑visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio‑Visual Intrinsic Coherence‑based deepfake detector. HAVIC first learns priors of modality‑specific structural coherence, inter‑modal micro‑ and macro‑coherence by pre‑training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio‑visual features for deepfake detection. Additionally, we introduce HiFi‑AVDF, a high‑fidelity audio‑visual deepfake dataset featuring both text‑to‑video and image‑to‑video forgeries from state‑of‑the‑art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state‑of‑the‑art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross‑dataset scenario. Our code and dataset are available at https://github.com/tuffy‑studio/HAVIC.

Abstract:
Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker‑disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source‑speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan‑acoustics/RiemannSD‑Net.

Abstract:
In this work, we focus on front‑end design for speech deepfake detectors, the component that determines the discriminative acoustic cues provided to the classifier. Existing approaches are primarily categorized into two types. Hand‑crafted filterbank features are transparent but limited in capturing higher‑level information. SSL features, in turn, lack interpretability and may overlook fine‑grained spectral anomalies. We propose the WST‑X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), which cascades wavelet convolutions with modulus nonlinearities to produce deformation‑stable, multi‑scale features. Experiments on the recent Deepfake‑Eval‑2024 benchmark, together with cross‑dataset evaluations on the SpoofCeleb and In‑the‑Wild, show that WST‑X outperforms existing front‑ends by a wide margin. Our analysis reveals that a small averaging scale (J), combined with high‑frequency and directional resolutions (Q, L), is critical for capturing subtle artifacts. This underscores the value of stable and translation‑invariant features for speech deepfake detection. The code is available at https://github.com/xxuan‑acoustics/WST‑X‑Series.

Abstract:
Localizing partial deepfake audio, where only segments of speech are manipulated, remains challenging due to the subtle and scattered nature of these modifications. Existing approaches typically rely on frame‑level predictions to identify spoofed segments, and some recent methods improve performance by concentrating on the transitions between real and fake audio. However, we observe that these models tend to over‑rely on boundary artifacts while neglecting the manipulated content that follows. We argue that effective localization requires understanding the entire segments beyond just detecting transitions. Thus, we propose Segment‑Aware Learning (SAL), a framework that encourages models to focus on the internal structure of segments. SAL introduces two core techniques: Segment Positional Labeling, which provides fine‑grained frame supervision based on relative position within a segment; and Cross‑Segment Mixing, a data augmentation method that generates diverse segment patterns. Experiments across multiple deepfake localization datasets show that SAL consistently achieves strong performance in both in‑domain and out‑of‑domain settings, with notable gains in non‑boundary regions and reduced reliance on transition artifacts. The code is available at https://github.com/SentryMao/SAL.

Abstract:
The rapid advancement of AI‑generated multimodal video‑audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes‑‑a limitation that fails to address the expanding landscape of general multimodal AI‑generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video‑Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI‑generated multimodal video‑audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video‑audio forgery patterns; (2) high perceptual quality achieved through diverse state‑of‑the‑art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video‑audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

Abstract:
Self‑supervised representations excel at many vision and speech tasks, but their potential for audio‑visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross‑modal complementarity. We find that most self‑supervised features capture deepfake‑relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts (such as the leading silence). Among the investigated features, audio‑informed representations generalize best and achieve state‑of‑the‑art results. However, generalization to realistic in‑the‑wild data remains challenging. Our analysis indicates this gap stems from intrinsic dataset difficulty rather than from features latching onto superficial patterns. Project webpage: https://bit‑ml.github.io/ssr‑dfd.

Abstract:
Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real‑world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.

Abstract:
Modern front‑end design for speech deepfake detection relies on full fine‑tuning of large pre‑trained models like XLSR. However, this approach is not parameter‑efficient and may lead to suboptimal generalization to realistic, in‑the‑wild data types. To address these limitations, we introduce a new family of parameter‑efficient front‑ends that fuse prompt‑tuning with classical signal processing transforms. These include FourierPT‑XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT‑XLSR and Partial‑WSPT‑XLSR. We further propose WaveSP‑Net, a novel architecture combining a Partial‑WSPT‑XLSR front‑end and a bidirectional Mamba‑based back‑end. This design injects multi‑resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP‑Net outperforms several state‑of‑the‑art models on two new and challenging benchmarks, Deepfake‑Eval‑2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan‑acoustics/WaveSP‑Net.

Abstract:
Component‑level audio Spoofing (Comp‑Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti‑spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component‑level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation‑enhanced joint learning framework that separates audio components apart and applies anti‑spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.

Abstract:
Advances in speech synthesis intensify security threats, motivating real‑time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self‑Attention in detecting synthetic speech. Our solution, Fake‑Mamba, integrates an XLSR front‑end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN‑BiMamba. Leveraging XLSR's rich linguistic representations, PN‑BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In‑The‑Wild benchmarks, Fake‑Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR‑Conformer and XLSR‑Mamba. The framework maintains real‑time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake‑Mamba.

Abstract:
Recent progress in generative AI has made it increasingly easy to create natural‑sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono‑ and cross‑lingual scenarios. We comparatively investigate DSP‑ and SSL‑based modeling; examine how SSL representations fine‑tuned on different languages impact cross‑lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual‑Source‑Tracing.

Abstract:
As speech generation technology advances, the risk of misuse through deepfake audio has become a pressing concern, which underscores the critical need for robust detection systems. However, many existing speech deepfake datasets are limited in scale and diversity, making it challenging to train models that can generalize well to unseen deepfakes. To address these gaps, we introduce SpeechFake, a large‑scale dataset designed specifically for speech deepfake detection. SpeechFake includes over 3 million deepfake samples, totaling more than 3,000 hours of audio, generated using 40 different speech synthesis tools. The dataset encompasses a wide range of generation techniques, including text‑to‑speech, voice conversion, and neural vocoder, incorporating the latest cutting‑edge methods. It also provides multilingual support, spanning 46 languages. In this paper, we offer a detailed overview of the dataset's creation, composition, and statistics. We also present baseline results by training detection models on SpeechFake, demonstrating strong performance on both its own test sets and various unseen test sets. Additionally, we conduct experiments to rigorously explore how generation methods, language diversity, and speaker variation affect detection performance. We believe SpeechFake will be a valuable resource for advancing speech deepfake detection and developing more robust models for evolving generation techniques.

Abstract:
The rapid surge of text‑to‑speech and face‑voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV‑Deepfake1M++, an extension of the AV‑Deepfake1M having 2 million video clips with diversified manipulation strategy and audio‑visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV‑Deepfake1M++ using state‑of‑the‑art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M‑Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research‑only license at https://deepfakes1m.github.io/2025.

Abstract:
Advances in voice conversion and text‑to‑speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti‑spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self‑supervised speech representations in limited‑data settings, substitutes the original graph attention block with a standardized multi‑head attention module using heterogeneous query projections, and replaces heuristic frame‑segment fusion with a trainable, context‑aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6% equal error rate (EER), improving on a re‑implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.

Abstract:
Generative models of music audio are typically used to generate output based solely on a text prompt or melody. Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model. In this work, we explore its application in the audio domain as a tool for data augmentation or content manipulation. Specifically, implementing Boomerang sampling for Stable Audio Open, we augment training data for a state‑of‑the‑art beat tracker, and attempt to replace musical instruments in recordings. Our results show that the rhythmic structure of existing examples is mostly preserved, that it improves performance of the beat tracker, but only in scenarios of limited training data, and that it can accomplish text‑based instrument replacement on monophonic inputs. We publish our implementation to invite experiments on data augmentation in other tasks and explore further applications.

Abstract:
Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code‑switched speech, where multiple languages are mixed within the same discourse. Code‑switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce ArEnAV, the first large‑scale Arabic‑English audio‑visual deepfake dataset featuring intra‑utterance code‑switching, dialectal variation, and monolingual Arabic content. It contains 387k videos and over 765 hours of real and fake videos. Our dataset is generated using a novel pipeline integrating four Text‑To‑Speech and two lip‑sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state‑of‑the‑art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed \hrefhttps://huggingface.co/datasets/kartik060702/ArEnAV‑Fullhere.

Abstract:
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards. Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation (e.g., telecom fraud). However, the existing defense methods cannot effectively prevent deepfake exploitation and are vulnerable to robust training techniques. Therefore, a more effective and robust data protection method is urgently needed. In response, we propose a defensive framework, SafeSpeech, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches to prevent high‑quality synthetic speech. In SafeSpeech, we devise a robust and universal proactive protection technique, Speech PErturbative Concealment (SPEC), that leverages a surrogate model to generate universally applicable perturbation for generative synthetic models. Moreover, we optimize the human perception of embedded perturbation in terms of time and frequency domains. To evaluate our method comprehensively, we conduct extensive experiments across advanced models and datasets, both subjectively and objectively. Our experimental results demonstrate that SafeSpeech achieves state‑of‑the‑art (SOTA) voice protection effectiveness and transferability and is highly robust against advanced adaptive adversaries. Moreover, SafeSpeech has real‑time capability in real‑world tests. The source code is available at \hrefhttps://github.com/wxzyd123/SafeSpeechhttps://github.com/wxzyd123/SafeSpeech.

Abstract:
Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio‑visual deepfake detection framework that leverages the inter‑modality differences in machine perception of speech, based on the assumption that in real samples ‑‑ in contrast to deepfakes ‑‑ visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame‑level cross‑modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a hierarchical cross‑modal fusion network, integrating adaptive temporal alignment modules and a learned discrepancy mapping layer to explicitly model the subtle differences between visual and audio representations. Then, the detection model is optimized through a composite loss function accounting for frame‑level detections and fake intervals localization. DiMoDif outperforms the state‑of‑the‑art on the Deepfake Detection task by 30.5 AUC on the highly challenging AV‑Deepfake1M, while it performs exceptionally on FakeAVCeleb and LAV‑DF. On the Temporal Forgery Localization task, it outperforms the state‑of‑the‑art by 47.88 AP@0.75 on AV‑Deepfake1M, and performs on‑par on LAV‑DF. Code available at https://github.com/mever‑team/dimodif.

Abstract:
This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing‑robust Automatic Speaker Verification (SASV), utilizing source data from real‑world conditions and spoofing attacks generated by Text‑To‑Speech (TTS) systems also trained on the same real‑world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high‑quality recordings (bona fide data) due to the requirements for TTS training; studio‑quality or well‑recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real‑world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well‑controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.

Abstract:
Large speech foundation models have shown strong potential for speech deepfake detection, but direct fine‑tuning is limited by a mismatch between self‑supervised pre‑training objectives and spoof‑specific artifacts. To address this, we propose a mix‑frame post‑training strategy to create localized spoof‑oriented perturbations and use frame‑level supervision to encourage the SSL model to learn local inconsistencies that are critical for robust spoof detection. On ASVspoof5, we achieve state‑of‑the‑art EER 4.50% for a single model without data augmentation. On ASVspoof2021 LA/DF, it further achieves only 0.16% absolute EER gap between LA and DF, indicating strong and balanced robustness across distinct distortion conditions. These results show that supervised post‑training provides an effective and practical way to adapt speech foundation models for robust deepfake detection.

Abstract:
In this study, we introduce the Elderly CodecFake Detection (ECFD) task and release the Elderly‑CodecFake (ECF) dataset in English and Chinese. We show that state‑of‑the‑art CF detectors trained on previous benchmark CF datasets generalize poorly to elderly speech, revealing a critical vulnerability. We further hypothesize and demonstrate that multimodal foundation models (FMs) such as LanguageBind (LB) and ImageBind (IB) are more effective for ECFD due to their exposure to elderly content during cross‑modal pretraining. Motivated by prior evidence that fusion of FMs enhances downstream performance, we explore fusion of FMs for ECFD. To this end, we propose BONSAI, a novel framework that employs Jensen‑Shannon Divergence as the fusion mechanism. BONSAI with the fusion of LB and IB achieves an average EER (%) of 1.66 and outperforms individual FMs as well as competitive SOTA baselines, establishing a new benchmark for the ECFD task.

Abstract:
Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. We audit this gap with a frozen state‑of‑the‑art SSL‑AASIST detector trained on ASVspoof 2019 LA. While its in‑domain EER is 0.21%, transferring its LA‑calibrated threshold to the In‑the‑Wild corpus yields a half total error rate (HTER) of 39.5%, with 78.7% of bona fide speech rejected, even though the In‑the‑Wild EER (11.2%) appears moderate. We then test whether popular unlabeled test‑time corrections close this gap, and first prove a simple proposition: any strictly increasing score transform, including z‑norm, temperature/shift calibration, and embedding mean alignment under a frozen linear head, cannot change EER. An audit of seven corrections on In‑the‑Wild and ASVspoof 2021 DF confirms the proposition empirically and exposes two further failure modes: AS‑norm with an unlabeled target cohort collapses (EER 11.2% to 60.2%), and pseudo‑label calibration that reduces HTER by 38% relative on In‑the‑Wild degenerates to 50% HTER on DF21, whose spoof prior is 96%. No audited correction reduces EER by more than 1% relative. We recommend reporting HTER at a transferred threshold alongside EER.

Abstract:
Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision‑making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient‑based attribution, produces low‑level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)‑based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task‑specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training‑free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45%, verified through human evaluation and faithfulness checks.

Abstract:
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme‑guided cross‑attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior P(\textspoofed\mid X, W), conditioned on the acoustic representation X and the phonetic posteriorgram W. The resulting factorization can be written as P(\textspoofed \mid X, W) = \sum_i=1^M w_i \cdot P(\textspoofed \mid X, Z = z_i), where M denotes the number of phonetic classes, P(\textspoofed \mid X, Z = z_i) is the spoofing probability for the i‑th phonetic class z_i conditioned on X, and each w_i is the prevalence of phonetic class z_i in the utterance. Our transformer‑based architecture instantiates this through a cross‑attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax‑normalized pooling supplying explicit phone‑presence weights. Unlike prior approaches that rely heavily on post‑hoc explainability methods, our framework offers phonetic‑explainability‑by‑design. We evaluate the framework on an LJSpeech‑derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per‑phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence‑boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per‑articulatory category breakdown of the final verdict.

Abstract:
With rapid advances in audio‑visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio‑visual deepfake detection typically rely on cross‑modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm‑aware generative models to fill the gap in singing benchmarks. To cope with cross‑scenario domain shifts, we propose a Text‑guided Audio‑Visual Forgery Detection (T‑AVFD) framework that generalizes across both talking and singing scenarios. T‑AVFD comprises a facial authenticity pattern learner and a multi‑modal differential weight learning module. The pattern learner aligns facial features with multi‑granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio‑visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

Abstract:
Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text‑to‑speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 ‑ 65.9%), while those from traditional seq2seq and flow‑matching models remain easier to spot (75.4 ‑ 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

Abstract:
Audio‑visual deepfake localization demands interval‑level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single‑sided or asynchronous forgeries propagates cross‑modal noise, degrading high‑precision localization. We present IaMSB, an inconsistency‑aware multimodal Schrödinger Bridge (SB) that jointly estimates cross‑modal consistency and performs interval‑level localization. Unlike diffusion models, SB minimizes path‑distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrödinger Bridge (SB), IaMSB unifies consistency estimation, cross‑modal information selection, and bridge‑step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross‑modal consistency; these statistics select cross‑modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step‑tuned fusion and outputs refined, time‑aligned intervals. IaMSB anticipates single‑sided and asynchronous forgeries and, using bottlenecked cross‑modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict‑IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high‑precision localization, particularly for single‑sided forgeries.

Abstract:
Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame‑level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token‑space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non‑temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low‑bitrate speech coding settings while enabling simple token‑space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task‑specific editing models.

Abstract:
The proliferation of deepfake audio challenges voice‑based authentication systems; passive forensic detectors are sensitive to evolving generative models and to real‑world channel distortions. We propose Asymmetric Phase Coding (APC), a training‑free cryptographic signing layer for audio, designed as a compact and auditable provenance primitive that can stand alone or be stacked with learned watermarks. APC combines Ed25519 digital signatures (EdDSA, FIPS 186‑5; 64‑byte signatures) with Reed‑Solomon error correction, pseudo‑random STFT phase‑bin selection, and a redundant quantization‑index‑modulation (QIM) code on log‑magnitude differences of adjacent bin pairs, yielding a compact, non‑repudiable, blind‑extractable watermark. We evaluate APC on 1,000 LibriSpeech test‑clean clips (10 s each, 44.1 kHz) under eight attack configurations ‑‑ identity, 10% end‑cropping, 20% end‑cropping, 8 kHz low‑pass, 16 kHz round‑trip resampling, FLAC re‑encoding, MP3 at 128 kbps, and OGG‑Vorbis at 128 kbps ‑‑ and achieve cryptographic verification rates between 97.5% and 98.3% on every condition at mean PESQ=3.02 and tens‑of‑milliseconds CPU latency. We explicitly compare APC against recent neural baselines (AudioSeal, WavMark, SilentCipher), detail the threat model (forgery resistance vs. erasure), characterize the dataset, define all metrics, quantify an adaptive white‑box erasure attack, and release code, keys, and metadata for reproducibility.

Abstract:
This paper describes a submission to the Environment‑Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026, which addresses component‑level deepfake detection using the CompSpoofV2 dataset, where speech and environmental sounds may be independently manipulated. To address this challenge, a dual‑branch deepfake detection framework is proposed to jointly model speech and environmental contextual representations from input audio. Two pretrained models, XLS‑R for speech and BEATs for environmental sound, are used to extract complementary contextual representations. A Matching Head is introduced to model representation differences through statistical normalization and representation interaction, enabling estimation of the original class. In parallel, multi‑head cross‑attention enables effective information exchange between speech and environmental components. The refined representations are processed with residual connections and layer normalization, and passed to an AASIST classifier to predict speech‑based and environment‑based spoofing probabilities. The model outputs original, speech, and environment predictions. On the test set, the proposed system achieves an F1‑score of 70.20% and an environmental EER of 16.54%, outperforming the baseline system.

Abstract:
Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS‑R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm‑started global cross‑batch queue. Stage 1 fine‑tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.

Abstract:
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real‑time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large‑scale speech deepfake dataset tailored for RTC scenarios, termed RTCFake, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme‑guided consistency learning (PCL) strategy that enforces models to learn platform‑invariant semantic structural representations. In this paper, the RTCFake dataset is divided into training, development, and evaluation sets. The evaluation set further includes both unseen RTC platforms and unseen complex noise conditions, thereby providing a more realistic and challenging evaluation benchmark for speech deepfake detection. Furthermore, the proposed PCL strategy achieves significant improvements in both cross‑platform generalization and noise robustness, offering an effective and generalizable modeling paradigm. The RTCFake dataset is provided in the https://huggingface.co/datasets/JunXueTech/RTCFake.

Abstract:
In this paper, we propose a deep‑learning framework for environmental sound deepfake detection (ESDD) ‑‑ the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre‑trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD‑Challenge‑TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre‑trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre‑trained WavLM model with the proposed three‑stage training strategy, achieve the Accuracy of 0.98, F1 Score of 0.95, AuC of 0.99 on EnvSDD Test subset and the Accuracy of 0.88, F1 Score of 0.77, and AuC of 0.92 on ESDD‑Challenge‑TestSet dataset.

Abstract:
Audio deepfakes pose a significant security threat, yet current state‑of‑the‑art (SOTA) detection systems do not generalize well to realistic in‑the‑wild deepfakes. We introduce a novel In‑Context Learning paradigm with comparison‑guidance for Audio Deepfake detection (ICLAD). The framework enables the use of audio language models (ALMs) for training‑free generalization to unseen deepfakes and provides textual rationales on the detection outcome. At the core of ICLAD is a pairwise comparative reasoning strategy that guides the ALM to discover and filter hallucinations and deepfake‑irrelevant acoustic attributes. The ALM works alongside a specialized deepfake detector, whereby a routing mechanism feeds out‑of‑distribution samples to the ALM. On in‑the‑wild datasets, ICLAD improves macro F1 over the specialized detector, with up to 2× relative improvement. Further analysis demonstrates the flexibility of ICLAD and its potential for deployment on recent open‑source ALMs.

Abstract:
Speech deepfake detection is a well‑established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open‑source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large‑scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross‑domain generalization, the choice of pre‑trained front‑end feature extractor dominates overall performance variance. Crucially, we show severe biases in high‑performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real‑world deployment with the necessary tools to address equitable training data selection and front‑end fine‑tuning.

Abstract:
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost‑effective, high‑fidelity generation and manipulation of both speech and non‑speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech‑centric, often relying on speech‑specific artifacts and exhibiting limited robustness to real‑world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All‑Type Audio Deepfake Detection (AT‑ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT‑ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real‑world scenarios and against unseen, state‑of‑the‑art speech generation methods; and (2) All‑Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type‑agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT‑ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.

Abstract:
Audio‑visual deepfake detection typically employs a complementary multi‑modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio‑visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi‑modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi‑scale cross‑modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi‑scale self‑attention to integrate the features of adjacent embeddings and a differential cross‑modal attention to fuse multi‑modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.

Abstract:
In order to gain fresh insights about the information processing characteristics of different audio classification models, we propose transferability analysis. Given a minimal, sufficient signal for a classification on a model f, transferability analysis asks whether other models accept this minimal signal as having the same classification as it did on f. We define what it means for a sufficient signal to be transferable and perform a large study over 3 different classification tasks: music genre, emotion recognition and deepfake detection. We find that transferability rates vary depending on the task, with sufficient signals for music genre being transferable \approx26% of the time. The other tasks reveal much higher variance in transferability and reveal that some models, in particular on deepfake detection, have different transferability behavior. We call these models `flat‑earther' models. We investigate deepfake audio in more depth, and show that transferability analysis also allows to us to discover information theoretic differences between the models which are not captured by the more familiar metrics of accuracy and precision.

Abstract:
Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame‑level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame‑level transitions. Building on this, we propose TRACE (Training‑free Representation‑based Audio Countermeasure via Embedding dynamics), a training‑free framework that detects partial audio deepfakes by analyzing the first‑order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine‑tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM‑driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target‑domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training‑free audio forensics.

Abstract:
Multimodal deepfakes can exhibit subtle visual artifacts and cross‑modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self‑supervised audio‑visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on‑the‑fly, identity‑preserving, region‑aware self‑blended pseudo‑manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross‑modal evidence, SAVe also models lip‑speech synchronization via an audio‑visual alignment component that detects temporal misalignment patterns characteristic of audio‑visual forgeries. Experiments on FakeAVCeleb and AV‑LipSync‑TIMIT demonstrate competitive in‑domain performance and strong cross‑dataset generalization, highlighting self‑supervised learning as a scalable paradigm for multimodal deepfake detection.

Abstract:
Current audio deepfake detection has achieved remarkable performance using diverse deep learning architectures such as ResNet, and has seen further improvements with the introduction of large models (LMs) like Wav2Vec. The success of large language models (LLMs) further demonstrates the benefits of scaling model parameters, but also highlights one bottleneck where performance gains are constrained by parameter counts. Simply stacking additional layers, as done in current LLMs, is computationally expensive and requires full retraining. Furthermore, existing low‑rank adaptation methods are primarily applied to attention‑based architectures, which limits their scope. Inspired by the neuronal plasticity observed in mammalian brains, we propose novel algorithms, dropin and further plasticity, that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. We evaluate these algorithms on multiple architectures, including ResNet, Gated Recurrent Neural Networks, and Wav2Vec. Experimental results using the widely recognised ASVSpoof2019 LA, PA, and FakeorReal dataset demonstrate consistent improvements in computational efficiency with the dropin approach and a maximum of around 39% and 66% relative reduction in Equal Error Rate with the dropin and plasticity approach among these dataset, respectively. The code and supplementary material are available at Github link.

Abstract:
Recent advancements in text‑to‑speech technologies enable generating high‑fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self‑supervised learning‑based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker‑specific correlations rather than artifact‑related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker‑nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker‑dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact‑related patterns, leading to state‑of‑the‑art performance.

Abstract:
Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse‑to‑fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantization levels capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer‑level hierarchy. We propose a hierarchy‑aware representation learning framework that models quantizer‑level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. Keeping the speech encoder backbone frozen and updating only 4.4% additional parameters, our method achieves relative EER reductions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 over strong baselines.

Abstract:
The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human‑like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human‑perceptible cues. In this paper, we propose HIR‑SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain‑of‑thought reasoning derived from the novel proposed human‑annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.

Abstract:
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV‑VASM, a probabilistic framework for verifying the robustness of voice anti‑spoofing models (VASMs). PV‑VASM estimates the probability of misclassification under text‑to‑speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model‑agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.

Abstract:
Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors complicate reliable detection in real‑world environments, underscoring the need for representative evaluation benchmarks. To this end, we introduce ML‑ITW (Multilingual In‑The‑Wild), a multilingual dataset covering 14 languages, seven major platforms, and 180 public figures, totaling 28.39 hours of audio. We evaluate three detection paradigms: end‑to‑end neural models, self‑supervised feature‑based (SSL) methods, and audio large language models (Audio LLMs). Experimental results reveal significant performance degradation across diverse languages and real‑world acoustic conditions, highlighting the limited generalization ability of existing detectors in practical scenarios. The ML‑ITW dataset is publicly available.

Abstract:
Self‑supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security‑critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof‑SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram‑based architectures. We evaluated these models on multiple in‑domain and out‑of‑domain datasets. Our results reveal that large‑scale discriminative models such as XLS‑R, UniSpeech‑SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker‑aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

Abstract:
Audio‑visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task‑specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV‑LMMDetect, a supervised fine‑tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification ‑ "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio‑visual encoder full fine‑tuning. On FakeAVCeleb and Mavos‑DD, AV‑LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos‑DD datasets.

Abstract:
As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS‑R front‑end features to improve generalization. Nonetheless, their performance remains limited, partly due to insufficient attention to fine‑grained information, such as physiological cues or frequency‑domain features. In this paper, we propose BreathNet, a novel audio deepfake detection framework that integrates fine‑grained breath information to improve generalization. Specifically, we design BreathFiLM, a feature‑wise linear modulation mechanism that selectively amplifies temporal representations based on the presence of breathing sounds. BreathFiLM is trained jointly with the XLS‑R extractor, in turn encouraging the extractor to learn and encode breath‑related cues into the temporal features. Then, we use the frequency front‑end to extract spectral features, which are then fused with temporal features to provide complementary information introduced by vocoders or compression artifacts. Additionally, we propose a group of feature losses comprising Positive‑only Supervised Contrastive Loss (PSCL), center loss, and contrast loss. These losses jointly enhance the discriminative ability, encouraging the model to separate bona fide and deepfake samples more effectively in the feature space. Extensive experiments on five benchmark datasets demonstrate state‑of‑the‑art (SOTA) performance. Using the ASVspoof 2019 LA training set, our method attains 1.99% average EER across four related eval benchmarks, with particularly strong performance on the In‑the‑Wild dataset, where it achieves 4.70% EER. Moreover, under the ASVspoof5 evaluation protocol, our method achieves an EER of 4.94% on this latest benchmark.

Abstract:
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high‑order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph‑based framework that explicitly models these synergistic HOIs through clustering‑based hyperedges with class‑aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state‑of‑the‑art methods by 13.96% on 4 challenging cross‑domain datasets, demonstrating superior generalization to diverse attacks and speakers.

Abstract:
Transformer‑based models have shown strong performance in speech deepfake detection, largely due to the effectiveness of the multi‑head self‑attention (MHSA) mechanism. MHSA provides frame‑level attention scores, which are particularly valuable because deepfake artifacts often occur in small, localized regions along the temporal dimension of speech. This makes fine‑grained frame modeling essential for accurately detecting subtle spoofing cues. In this work, we propose fine‑grained frame modeling (FGFM) for MHSA‑based speech deepfake detection, where the most informative frames are first selected through a multi‑head voting (MHV) module. These selected frames are then refined via a cross‑layer refinement (CLR) module to enhance the model's ability to learn subtle spoofing cues. Experimental results demonstrate that our method outperforms the baseline model and achieves Equal Error Rate (EER) of 0.90%, 1.88%, and 6.64% on the LA21, DF21, and ITW datasets, respectively. These consistent improvements across multiple benchmarks highlight the effectiveness of our fine‑grained modeling for robust speech deepfake detection.

Abstract:
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state‑of‑the‑art self‑supervised models provide rich multi‑layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin‑based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain‑invariant embeddings. Evaluated on ASVspoof 2021 DF and In‑the‑Wild datasets, our method achieves state‑of‑the‑art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross‑domain generation techniques and recording conditions.

Abstract:
The rapid advances in text‑to‑speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single‑speaker audio deepfakes, real‑world malicious applications with multi‑speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi‑speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi‑speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two‑speaker conversations, generated using VITS and SoundStorm‑based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text‑to‑speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC‑LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi‑speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.

Abstract:
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)‑based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine‑grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic‑dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception‑enhanced Audio Large Language Model (SDD‑APALLM), an acoustically enhanced framework designed to explicitly expose fine‑grained time‑frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.

Abstract:
In this work, we introduce a multi‑task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker‑formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built‑in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.

Abstract:
Audio recorded in real‑world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text‑to‑speech, voice conversion, and other generation models, either component can now be modified independently. Such component‑level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation‑enhanced joint learning framework. CompSpoofV2 is a large‑scale curated dataset designed for component‑level audio anti‑spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation‑enhanced joint learning framework, we launch the Environment‑Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component‑level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).

Abstract:
Recent advances in audio large language models (ALLMs) have made high‑quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real‑world audio deepfake detection (ADD) therefore requires all‑type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi‑task generalization ability of ALLMs, we first investigate their performance on all‑type ADD under both supervised fine‑tuning (SFT) and reinforcement fine‑tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black‑box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency‑Time structured chain‑of‑thought (CoT) rationales, producing ~340K cold‑start demonstrations. Building on CoT data, we propose Frequency Time‑Group Relative Policy Optimization (FT‑GRPO), a two‑stage training paradigm that cold‑starts ALLMs with SFT and then applies GRPO under rule‑based frequency‑time constraints. Experiments demonstrate that FT‑GRPO achieves state‑of‑the‑art performance on all‑type ADD while producing interpretable, FT‑grounded rationales. The data and code are available online.

Abstract:
As audio deepfakes transition from research artifacts to widely available commercial tools, robust biometric authentication faces pressing security threats in high‑stakes industries. This paper presents a systematic empirical evaluation of state‑of‑the‑art speaker authentication systems based on a large‑scale speech synthesis dataset, revealing two major security vulnerabilities: 1) modern voice cloning models trained on very small samples can easily bypass commercial speaker verification systems; and 2) anti‑spoofing detectors struggle to generalize across different methods of audio synthesis, leading to a significant gap between in‑domain performance and real‑world robustness. These findings call for a reconsideration of security measures and stress the need for architectural innovations, adaptive defenses, and the transition towards multi‑factor authentication.

Abstract:
The rapid growth of speech synthesis and voice conversion systems has made deepfake audio a major security concern. Bengali deepfake detection remains largely unexplored. In this work, we study automatic detection of Bengali audio deepfakes using the BanglaFake dataset. We evaluate zeroshot inference with several pretrained models. These include Wav2Vec2‑XLSR‑53, Whisper, PANNsCNN14, WavLM and Audio Spectrogram Transformer. Zero‑shot results show limited detection ability. The best model, Wav2Vec2‑XLSR‑53, achieves 53.80% accuracy, 56.60% AUC and 46.20% EER. We then f ine‑tune multiple architectures for Bengali deepfake detection. These include Wav2Vec2‑Base, LCNN, LCNN‑Attention, ResNet18, ViT‑B16 and CNN‑BiLSTM. Fine‑tuned models show strong performance gains. ResNet18 achieves the highest accuracy of 79.17%, F1 score of 79.12%, AUC of 84.37% and EER of 24.35%. Experimental results confirm that fine‑tuning significantly improves performance over zero‑shot inference. This study provides the first systematic benchmark of Bengali deepfake audio detection. It highlights the effectiveness of f ine‑tuned deep learning models for this low‑resource language.

Abstract:
Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model‑centric and algorithm‑centric solutions, the impact of data composition is often underexplored. This paper proposes a data‑centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large‑scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity‑Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS‑Select (pruning) and DOSS‑Weight (re‑weighting). Our experiments show that DOSS‑Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k‑hour curated data pool using the optimal DOSS‑Weight strategy, achieves state‑of‑the‑art performance, outperforming large‑scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.

Abstract:
Deepfake audio detection has progressed rapidly with strong pre‑trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions ‑ background noise (domestic/office/transport), room reverberation, and consumer channels ‑ often lags clean‑lab results. We survey and evaluate robustness for state‑of‑the‑art audio deepfake detection models and present a reproducible framework that mixes MS‑SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal‑to‑noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near‑clean (35 dB) to very noisy (‑5 dB) to quantify graceful degradation. We study multi‑condition training and fixed‑SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC‑AUC, and EER on binary and four‑class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10‑15 percentage points at 10‑0 dB SNR across backbones.

Abstract:
This paper introduces quantum circuit methodologies for pointwise multiplication and convolution of complex functions, conceptualized as "processing through encoding". Leveraging known techniques, we describe an approach where multiple complex functions are encoded onto auxiliary qubits. Applying the proposed scheme for two functions f and g, their pointwise product f(x)g(x) is shown to naturally form as the coefficients of part of the resulting quantum state. Adhering to the convolution theorem, we then demonstrate how the convolution fg can be constructed. Similarly to related work, this involves the encoding of the Fourier coefficients \mathcalF[f] and \mathcalF[g], which facilitates their pointwise multiplication, followed by the inverse Quantum Fourier Transform. We discuss the simulation of these techniques, their integration into an extended \verb|quantumaudio| package for audio signal processing, and present initial experimental validations. This work offers a promising avenue for quantum signal processing, with potential applications in areas such as quantum‑enhanced audio manipulation and synthesis.

Abstract:
Speech deepfake detection (DFD) has benefited from diverse acoustic and semantic speech representations, many of which encode valuable speech information and are costly to train. Existing approaches typically enhance DFD by tuning the representations or applying post‑hoc classification on frozen features, limiting control over improving discriminative DF cues without distorting original semantics. We find that emotion is encoded across diverse speech features and correlates with DFD. Therefore, we introduce a unified, feature‑agnostic, and non‑destructive training framework that uses emotion as a bridging constraint to guide speech features toward DFD, treating emotion recognition as a representation alignment objective rather than an auxiliary task, while preserving the original semantic information. Experiments on FakeOrReal and IntheWild show accuracy improvements of up to 6% and 2%, respectively, with corresponding reductions in equal error rate. Code is in the supplementary material.

Abstract:
Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3‑Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3‑Tracer consists of two complementary core modules: the Frame‑Audio Feature Aggregation Module (FA‑FAM) and the Segment‑level Multi‑Scale Discrepancy‑Aware Module (SMDAM). FA‑FAM is designed to detect the authenticity of each audio frame. It combines both frame‑level and audio‑level temporal information to detect intra‑frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual‑branch architecture that jointly models frame features and inter‑frame differences across multi‑scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state‑of‑the‑art performance.

Abstract:
The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human‑centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega‑MMDF, a large‑scale, diverse, and high‑quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio‑driven face reenactment methods. Mega‑MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the largest and most diverse multimodal deepfake datasets, with plans for continuous expansion. Building on it, we present DeepfakeBench‑MM, the first unified benchmark for multimodal deepfake detection. It establishes standardized protocols across the entire detection pipeline and serves as a versatile platform for evaluating existing methods as well as exploring novel approaches. DeepfakeBench‑MM currently supports 5 datasets and 11 multimodal deepfake detectors. Furthermore, our comprehensive evaluations and in‑depth analyses uncover several key findings from multiple perspectives (e.g., augmentation, stacked forgery). We believe that DeepfakeBench‑MM, together with our large‑scale Mega‑MMDF, will serve as foundational infrastructures for advancing multimodal deepfake detection.

Abstract:
The rapid advancement of generative models has enabled the creation of increasingly stealthy synthetic voices, commonly referred to as audio deepfakes. A recent technique, FOICE [USENIX'24], demonstrates a particularly alarming capability: generating a victim's voice from a single facial image, without requiring any voice sample. By exploiting correlations between facial and vocal features, FOICE produces synthetic voices realistic enough to bypass industry‑standard authentication systems, including WeChat Voiceprint and Microsoft Azure. This raises serious security concerns, as facial images are far easier for adversaries to obtain than voice samples, dramatically lowering the barrier to large‑scale attacks. In this work, we investigate two core research questions: (RQ1) can state‑of‑the‑art audio deepfake detectors reliably detect FOICE‑generated speech under clean and noisy conditions, and (RQ2) whether fine‑tuning these detectors on FOICE data improves detection without overfitting, thereby preserving robustness to unseen voice generators such as SpeechT5. Our study makes three contributions. First, we present the first systematic evaluation of FOICE detection, showing that leading detectors consistently fail under both standard and noisy conditions. Second, we introduce targeted fine‑tuning strategies that capture FOICE‑specific artifacts, yielding significant accuracy improvements. Third, we assess generalization after fine‑tuning, revealing trade‑offs between specialization to FOICE and robustness to unseen synthesis pipelines. These findings expose fundamental weaknesses in today's defenses and motivate new architectures and training protocols for next‑generation audio deepfake detection.

Abstract:
The growing prevalence of speech deepfakes has raised serious concerns, particularly in real‑world scenarios such as telephone fraud and identity theft. While many anti‑spoofing systems have demonstrated promising performance on lab‑generated synthetic speech, they often fail when confronted with physical replay attacks‑a common and low‑cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting‑edge zero‑shot text‑to‑speech (TTS) speech and physical replay recordings collected under varied devices and real‑world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real‑world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

Abstract:
In this paper, we introduce the concept of forensic similarity in the speech deepfake detection domain, which aims to determine whether two audio segments share the same underlying forensic traces. Our approach is inspired by prior work in the image domain. To transfer this idea to the audio domain, we propose a two‑stage deep learning framework consisting of a Siamese‑based feature extractor and a core decision module, referred to as the similarity network. The system goal to assess whether two speech samples originate from the same source by comparing their forensic characteristics. In practice, the model maps pairs of audio segments to a similarity score indicating whether they contain identical or different forensic traces. We evaluate the proposed method on the emerging task of source verification, demonstrating its ability to determine whether two speech samples were generated by the same model. In addition, we explore its applicability to audio splicing detection as a complementary use case. Experimental results show that the proposed approach generalizes well to previously unseen forensic traces, highlighting its robustness, flexibility, and practical relevance for digital audio forensics.

Abstract:
While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real‑world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real‑world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.

Abstract:
Neural speech synthesis techniques have enabled highly realistic speech deepfakes, posing major security risks. Speech deepfake detection is challenging due to distribution shifts across spoofing methods and variability in speakers, channels, and recording conditions. We explore learning shared discriminative features as a path to robust detection and propose Information Bottleneck enhanced Confidence‑Aware Adversarial Network (IB‑CAAN). Confidence‑guided adversarial alignment adaptively suppresses attack‑specific artifacts without erasing discriminative cues, while the information bottleneck removes nuisance variability to preserve transferable features. Experiments on ASVspoof 2019/2021, ASVspoof 5, and In‑the‑Wild demonstrate that IB‑CAAN consistently outperforms baseline and achieves state‑of‑the‑art performance on many benchmarks.

Abstract:
With the prevalence of artificial intelligence (AI)‑generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real‑world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open‑source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in‑ and out‑of‑domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.

Abstract:
In speech deepfake detection (SDD), data augmentation (DA) is commonly used to improve model generalization across varied speech conditions and spoofing attacks. However, during training, the backpropagated gradients from original and augmented inputs may misalign, which can result in conflicting parameter updates. These conflicts could hinder convergence and push the model toward suboptimal solutions, thereby reducing the benefits of DA. To investigate and address this issue, we design a dual‑path data‑augmented (DPDA) training framework with gradient alignment for SDD. In our framework, each training utterance is processed through two input paths: one using the original speech and the other with its augmented version. This design allows us to compare and align their backpropagated gradient directions to reduce optimization conflicts. Our analysis shows that approximately 25% of training iterations exhibit gradient conflicts between the original inputs and their augmented counterparts when using RawBoost augmentation. By resolving these conflicts with gradient alignment, our method accelerates convergence by reducing the number of training epochs and achieves up to an 18.69% relative reduction in Equal Error Rate on the In‑the‑Wild dataset compared to the baseline.

Abstract:
Recent work shows that one‑class learning can detect unseen deepfake attacks by modeling a compact distribution of bona fide speech around a single centroid. However, the single‑centroid assumption can oversimplify the bona fide speech representation and overlook useful cues, such as speech quality, which reflects the naturalness of the speech. Speech quality can be easily obtained using existing speech quality assessment models that estimate it through Mean Opinion Score. In this paper, we propose QAMO: Quality‑Aware Multi‑Centroid One‑Class Learning for speech deepfake detection. QAMO extends conventional one‑class learning by introducing multiple quality‑aware centroids. In QAMO, each centroid is optimized to represent a distinct speech quality subspaces, enabling better modeling of intra‑class variability in bona fide speech. In addition, QAMO supports a multi‑centroid ensemble scoring strategy, which improves decision thresholding and reduces the need for quality labels during inference. With two centroids to represent high‑ and low‑quality speech, our proposed QAMO achieves an equal error rate of 5.09% in In‑the‑Wild dataset, outperforming previous one‑class and quality‑aware systems.

Abstract:
Speech deepfake detectors are often evaluated on clean, benchmark‑style conditions, but deployment occurs in an open world of shifting devices, sampling rates, codecs, environments, and attack families. This creates a ``coverage debt" for AI‑based detectors: every new condition multiplies with existing ones, producing data blind spots that grow faster than data can be collected. Because attackers can target these uncovered regions, worst‑case performance (not average benchmark scores) determines security. To demonstrate the impact of the coverage debt problem, we analyze results from a recent cross‑testing framework. Grouping performance by bona fide domain and spoof release year, two patterns emerge: newer synthesizers erase the legacy artifacts detectors rely on, and conversational speech domains (teleconferencing, interviews, social media) are consistently the hardest to secure. These findings show that detection alone should not be relied upon for high‑stakes decisions. Detectors should be treated as auxiliary signals within layered defenses that include provenance, personhood credentials, and policy safeguards.

Abstract:
The rapid growth of the digital economy in South‑East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained on high‑resource languages collapse when applied to SEA, due to mismatches in synthesis quality, language‑specific characteristics, and data scarcity. To close this gap, we present SEA‑Spoof, the first large‑scale Audio Deepfake Detection (ADD) dataset especially for SEA languages. SEA‑Spoof spans 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese. Spoof samples are generated from a diverse mix of state‑of‑the‑art open‑source and commercial systems, capturing wide variability in style and fidelity. Benchmarking state‑of‑the‑art detection models reveals severe cross‑lingual degradation, but fine‑tuning on SEA‑Spoof dramatically restores performance across languages and synthesis sources. These results highlight the urgent need for SEA‑focused research and establish SEA‑Spoof as a foundation for developing robust, cross‑lingual, and fraud‑resilient detection systems.

Abstract:
In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM‑Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact‑related information, we trained self‑supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi‑scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand‑crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one‑class loss functions and provided optimized configurations to better align with the anti‑spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition.

Abstract:
AI‑generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfakes become nearly indistinguishable from real human speech, the need for robust detection methods and effective countermeasures has become critically urgent. In this paper, we present the ISPL's submission to the SAFE challenge at IH&MMSec 2025, where our system ranked first across all tasks. Our solution introduces a novel approach to audio deepfake detection based on a Mixture of Experts architecture. The proposed system leverages multiple state‑of‑the‑art detectors, combining their outputs through an attention‑based gating network that dynamically weights each expert based on the input speech signal. In this design, each expert develops a specialized understanding of the shared training data by learning to capture different complementary aspects of the same input through inductive biases. Experimental results indicate that our method outperforms existing approaches across multiple datasets. We further evaluate and analyze the performance of our system in the SAFE challenge.

Abstract:
The widespread use of generative AI has shown remarkable success in producing highly realistic deepfakes, posing a serious threat to various voice biometric applications, including speaker verification, voice biometrics, audio conferencing, and criminal investigations. To counteract this, several state‑of‑the‑art (SoTA) audio deepfake detection (ADD) methods have been proposed to identify generative AI signatures to distinguish between real and deepfake audio. However, the effectiveness of these methods is severely undermined by anti‑forensic (AF) attacks that conceal generative signatures. These AF attacks span a wide range of techniques, including statistical modifications (e.g., pitch shifting, filtering, noise addition, and quantization) and optimization‑based attacks (e.g., FGSM, PGD, C \& W, and DeepFool). In this paper, we investigate the SoTA ADD methods and provide a comparative analysis to highlight their effectiveness in exposing deepfake signatures, as well as their vulnerabilities under adversarial conditions. We conducted an extensive evaluation of ADD methods on five deepfake benchmark datasets using two categories: raw and spectrogram‑based approaches. This comparative analysis enables a deeper understanding of the strengths and limitations of SoTA ADD methods against diverse AF attacks. It does not only highlight vulnerabilities of ADD methods, but also informs the design of more robust and generalized detectors for real‑world voice biometrics. It will further guide future research in developing adaptive defense strategies that can effectively counter evolving AF techniques.

Abstract:
Speech synthesis systems can now produce highly realistic vocalisations that pose significant authenticity challenges. Despite substantial progress in deepfake detection models, their real‑world effectiveness is often undermined by evolving distribution shifts between training and test data, driven by the complexity of human speech and the rapid evolution of synthesis systems. Existing datasets suffer from limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous mixtures of deepfake sources, which hinder systematic evaluation and open‑world model training. To address these issues, we introduce AUDETER (AUdio DEepfake TEst Range), a large‑scale and highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders, totalling 3 million clips. We further observe that most existing detectors default to binary supervised training, which can induce negative transfer across synthesis sources when the training data contains highly diverse deepfake patterns, impacting overall generalisation. As a complementary contribution, we propose an effective curriculum‑learning‑based approach to mitigate this effect. Extensive experiments show that existing detection models struggle to generalise to novel deepfakes and human speech in AUDETER, whereas XLR‑based detectors trained on AUDETER achieve strong cross‑domain performance across multiple benchmarks, achieving an EER of 1.87% on In‑the‑Wild. AUDETER is available on GitHub.

Abstract:
Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural‑sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS‑R model as a front‑end feature extractor. For downstream back‑end classifier, we employ Multi‑kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state‑of‑the‑art performance on in‑domain benchmarks while generalizing robustly to out‑of‑domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.

Abstract:
Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state‑of‑the‑art open‑source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out‑of‑domain scenarios, highlighting the need for extensive cross‑domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.

Abstract:
Deepfake detection is a critical task in identifying manipulated multimedia content. In real‑world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF‑BA‑TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio‑visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF‑BA‑TFD+ lies in its ability to model long‑range dependencies within the audio‑visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF‑BA‑TFD+ on the DDL‑AV dataset, which consists of both segmented and full‑length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL‑AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state‑of‑the‑art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF‑BA‑TFD+ model demonstrated its effectiveness in the "Workshop on Deepfake Detection, Localization, and Interpretability," Track 2: Audio‑Visual Detection and Localization (DDL‑AV), and won first place in this competition.

Abstract:
Current audio deepfake detectors cannot be trusted. While they excel on controlled benchmarks, they fail when tested in the real world. We introduce Perturbed Public Voices (P^2V), an IRB‑approved dataset capturing three critical aspects of malicious deepfakes: (1) identity‑consistent transcripts via LLMs, (2) environmental and adversarial noise, and (3) state‑of‑the‑art voice cloning (2020‑2025). Experiments reveal alarming vulnerabilities of 22 recent audio deepfake detectors: models trained on current datasets lose 43% performance when tested on P^2V, with performance measured as the mean of F1 score on deepfake audio, AUC, and 1‑EER. Simple adversarial perturbations induce up to 16% performance degradation, while advanced cloning techniques reduce detectability by 20‑30%. In contrast, P^2V‑trained models maintain robustness against these attacks while generalizing to existing datasets, establishing a new benchmark for robust audio deepfake detection. P^2V will be publicly released upon acceptance by a conference/journal.

Abstract:
The rapid advancement of speech generation technology has led to the widespread proliferation of deepfake speech across social media platforms. While deepfake audio countermeasures (CMs) achieve promising results on public datasets, their performance degrades significantly in cross‑domain scenarios. To advance CMs for real‑world deepfake detection, we first propose the Fake Speech Wild (FSW) dataset, which includes 254 hours of real and deepfake audio from four different media platforms, focusing on social media. As CMs, we establish a benchmark using public datasets and advanced selfsupervised learning (SSL)‑based CMs to evaluate current CMs in real‑world scenarios. We also assess the effectiveness of data augmentation strategies in enhancing CM robustness for detecting deepfake speech on social media. Finally, by augmenting public datasets and incorporating the FSW training set, we significantly advanced real‑world deepfake audio detection performance, achieving an average equal error rate (EER) of 3.54% across all evaluation sets.

Abstract:
The rapid development of audio‑driven talking head generators and advanced Text‑To‑Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state‑of‑the‑art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV‑Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self‑supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and real‑world deployment, focusing on resilience and potential interpretability. On the AV‑Deepfake1M++ dataset, our multimodal system achieves AUC of 92.78% for deepfake classification task and IoU of 0.3536 for temporal localization using only the audio modality.

Abstract:
The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention‑enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open‑set conditions, we incorporate confidence‑based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1‑scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA's robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open‑set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at https://www.github.com/adipiz99/lava‑framework.

Abstract:
Advances in Generative AI have made video‑level deepfake detection increasingly challenging, exposing the limitations of current detection techniques. In this paper, we present HOLA, our solution to the Video‑Level Deepfake Detection track of 2025 1M‑Deepfakes Detection Challenge. Inspired by the success of large‑scale pre‑training in the general domain, we first scale audio‑visual self‑supervised pre‑training in the multimodal video‑level deepfake detection, which leverages our self‑built dataset of 1.81M samples, thereby leading to a unified two‑stage framework. To be specific, HOLA features an iterative‑aware cross‑modal learning module for selective audio‑visual interactions, hierarchical contextual modeling with gated aggregations under the local‑global perspective, and a pyramid‑like refiner for scale‑aware cross‑grained semantic enhancements. Moreover, we propose the pseudo supervised singal injection strategy to further boost model performance. Extensive experiments across expert models and MLLMs impressivly demonstrate the effectiveness of our proposed HOLA. We also conduct a series of ablation studies to explore the crucial design factors of our introduced components. Remarkably, our HOLA ranks 1st, outperforming the second by 0.0476 AUC on the TestA set.

Abstract:
Recent advances in synthetic speech have made audio deepfakes increasingly realistic, posing significant security risks. Existing detection methods that rely on a single modality, either raw waveform embeddings or spectral based features, are vulnerable to non spoof disturbances and often overfit to known forgery algorithms, resulting in poor generalization to unseen attacks. To address these shortcomings, we investigate hybrid fusion frameworks that integrate self supervised learning (SSL) based representations with handcrafted spectral descriptors (MFCC , LFCC, CQCC). By aligning and combining complementary information across modalities, these fusion approaches capture subtle artifacts that single feature approaches typically overlook. We explore several fusion strategies, including simple concatenation, cross attention, mutual cross attention, and a learnable gating mechanism, to optimally blend SSL features with fine grained spectral cues. We evaluate our approach on four challenging public benchmarks and report generalization performance. All fusion variants consistently outperform an SSL only baseline, with the cross attention strategy achieving the best generalization with a 38% relative reduction in equal error rate (EER). These results confirm that joint modeling of waveform and spectral views produces robust, domain agnostic representations for audio deepfake detection.

Abstract:
Audio plays a crucial role in applications like speaker verification, voice‑enabled smart devices, and audio conferencing. However, audio manipulations, such as deepfakes, pose significant risks by enabling the spread of misinformation. Our empirical analysis reveals that existing methods for detecting deepfake audio are often vulnerable to anti‑forensic (AF) attacks, particularly those attacked using generative adversarial networks. In this article, we propose a novel collaborative learning method called SHIELD to defend against generative AF attacks. To expose AF signatures, we integrate an auxiliary generative model, called the defense (DF) generative model, which facilitates collaborative learning by combining input and output. Furthermore, we design a triplet model to capture correlations for real and AF attacked audios with real‑generated and attacked‑generated audios using auxiliary generative models. The proposed SHIELD strengthens the defense against generative AF attacks and achieves robust performance across various generative models. The proposed AF significantly reduces the average detection accuracy from 95.49% to 59.77% for ASVspoof2019, from 99.44% to 38.45% for In‑the‑Wild, and from 98.41% to 51.18% for HalfTruth for three different generative models. The proposed SHIELD mechanism is robust against AF attacks and achieves an average accuracy of 98.13%, 98.58%, and 99.57% in match, and 98.78%, 98.62%, and 98.85% in mismatch settings for the ASVspoof2019, In‑the‑Wild, and HalfTruth datasets, respectively.

Abstract:
Recent advances in generative AI have made the creation of speech deepfakes widely accessible, posing serious challenges to digital trust. To counter this, various speech deepfake detection strategies have been proposed, including Person‑of‑Interest (POI) approaches, which focus on identifying impersonations of specific individuals by modeling and analyzing their unique vocal traits. Despite their excellent performance, the existing methods offer limited granularity and lack interpretability. In this work, we propose a POI‑based speech deepfake detection method that operates at the phoneme level. Our approach decomposes reference audio into phonemes to construct a detailed speaker profile. In inference, phonemes from a test sample are individually compared against this profile, enabling fine‑grained detection of synthetic artifacts. The proposed method achieves comparable accuracy to traditional approaches while offering superior robustness and interpretability, key aspects in multimedia forensics. By focusing on phoneme analysis, this work explores a novel direction for explainable, speaker‑centric deepfake detection.

Abstract:
In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre‑trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data augmentation strategies and loss functions on model performance. The results of our research demonstrate substantial enhancements in the generalization capabilities of audio deepfake detection models, surpassing the performance of the top‑ranked single system in the ASVspoof 5 Challenge. This study contributes valuable insights into the optimization of audio models for more robust deepfake detection and facilitates future research in this critical area.

Abstract:
Advancements in audio deepfake technology offers benefits like AI assistants, better accessibility for speech impairments, and enhanced entertainment. However, it also poses significant risks to security, privacy, and trust in digital communications. Detecting and mitigating these threats requires comprehensive datasets. Existing datasets lack diverse ethnic accents, making them inadequate for many real‑world scenarios. Consequently, models trained on these datasets struggle to detect audio deepfakes in diverse linguistic and cultural contexts such as in South‑Asian countries. Ironically, there is a stark lack of South‑Asian speaker samples in the existing datasets despite constituting a quarter of the worlds population. This work introduces the IndieFake Dataset (IFD), featuring 27.17 hours of bonafide and deepfake audio from 50 English speaking Indian speakers. IFD offers balanced data distribution and includes speaker‑level characterization, absent in datasets like ASVspoof21 (DF). We evaluated various baselines on IFD against existing ASVspoof21 (DF) and In‑The‑Wild (ITW) datasets. IFD outperforms ASVspoof21 (DF) and proves to be more challenging compared to benchmark ITW dataset. The complete dataset, along with documentation and sample reference clips, is publicly accessible for research use on project website.

Abstract:
With the development of audio deepfake techniques, attacks with partially deepfake audio are beginning to rise. Compared to fully deepfake, it is much harder to be identified by the detector due to the partially cryptic manipulation, resulting in higher security risks. Although some studies have been launched, there is no comprehensive review to systematically introduce the current situations and development trends for addressing this issue. Thus, in this survey, we are the first to outline a systematic introduction for partially deepfake audio manipulated region localization tasks, including the fundamentals, branches of existing methods, current limitations and potential trends, providing a revealing insight into this scope.

Abstract:
A new class of audio deepfakes‑codecfakes (CFs)‑has recently caught attention, synthesized by Audio Language Models that leverage neural audio codecs (NACs) in the backend. In response, the community has introduced dedicated benchmarks and tailored detection strategies. As the field advances, efforts have moved beyond binary detection toward source attribution, including open‑set attribution, which aims to identify the NAC responsible for generation and flag novel, unseen ones during inference. This shift toward source attribution improves forensic interpretability and accountability. However, open‑set attribution remains fundamentally limited: while it can detect that a NAC is unfamiliar, it cannot characterize or identify individual unseen codecs. It treats such inputs as generic ``unknowns'', lacking insight into their internal configuration. This leads to major shortcomings: limited generalization to new NACs and inability to resolve fine‑grained variations within NAC families. To address these gaps, we propose Neural Audio Codec Source Parsing (NACSP) ‑ a paradigm shift that reframes source attribution for CFs as structured regression over generative NAC parameters such as quantizers, bandwidth, and sampling rate. We formulate NACSP as a multi‑task regression task for predicting these NAC parameters and establish the first comprehensive benchmark using various state‑of‑the‑art speech pre‑trained models (PTMs). To this end, we propose HYDRA, a novel framework that leverages hyperbolic geometry to disentangle complex latent properties from PTM representations. By employing task‑specific attention over multiple curvature‑aware hyperbolic subspaces, HYDRA enables superior multi‑task generalization. Our extensive experiments show HYDRA achieves top results on benchmark CFs datasets compared to baselines operating in Euclidean space.

Abstract:
Generalization remains a critical challenge in speech deepfake detection (SDD). While various approaches aim to improve robustness, generalization is typically assessed through performance metrics like equal error rate without a theoretical framework to explain model performance. This work investigates sharpness as a theoretical proxy for generalization in SDD. We analyze how sharpness responds to domain shifts and find it increases in unseen conditions, indicating higher model sensitivity. Based on this, we apply Sharpness‑Aware Minimization (SAM) to reduce sharpness explicitly, leading to better and more stable performance across diverse unseen test sets. Furthermore, correlation analysis confirms a statistically significant relationship between sharpness and generalization in most test settings. These findings suggest that sharpness can serve as a theoretical indicator for generalization in SDD and that sharpness‑aware training offers a promising strategy for improving robustness.

Abstract:
This paper introduces a novel multimodal framework for hate speech detection in deepfake audio, excelling even in zero‑shot scenarios. Unlike previous approaches, our method uses contrastive learning to jointly align audio and text representations across languages. We present the first benchmark dataset with 127,290 paired text and synthesized speech samples in six languages: English and five low‑resource Indian languages (Hindi, Bengali, Marathi, Tamil, Telugu). Our model learns a shared semantic embedding space, enabling robust cross‑lingual and cross‑modal classification. Experiments on two multilingual test sets show our approach outperforms baselines, achieving accuracies of 0.819 and 0.701, and generalizes well to unseen languages. This demonstrates the advantage of combining modalities for hate speech detection in synthetic media, especially in low‑resource settings where unimodal models falter. The Dataset is available at https://www.iab‑rubric.org/resources.

Abstract:
Deepfakes are AI‑synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio‑visual deepfakes, previous studies commonly employ two relatively independent sub‑models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the inherent correlations between audio and visual features. Moreover, utilizing two isolated feature learning sub‑models can result in redundant neural layers, making the overall model inefficient and impractical for resource‑constrained environments. In this work, we design a lightweight network for audio‑visual deepfake detection via a single‑stream multi‑modal learning framework. Specifically, we introduce a collaborative audio‑visual learning block to efficiently integrate multi‑modal information while learning the visual and audio features. By iteratively employing this block, our single‑stream network achieves a continuous fusion of multi‑modal features across its layers. Thus, our network efficiently captures visual and audio features without the need for excessive block stacking, resulting in a lightweight network design. Furthermore, we propose a multi‑modal classification module that can boost the dependence of the visual and audio classifiers on modality content. It also enhances the whole resistance of the video classifier against the mismatches between audio and visual modalities. We conduct experiments on the DF‑TIMIT, FakeAVCeleb, and DFDC benchmark datasets. Compared to state‑of‑the‑art audio‑visual joint detection methods, our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni‑modal and multi‑modal deepfakes, as well as in unseen types of deepfakes.

Abstract:
The rise of deepfake audio and hate speech, powered by advanced text‑to‑speech, threatens online safety. We present SynHate, the first multilingual dataset for detecting hate speech in synthetic audio, spanning 37 languages. SynHate uses a novel four‑class scheme: Real‑normal, Real‑hate, Fake‑normal, and Fake‑hate. Built from MuTox and ADIMA datasets, it captures diverse hate speech patterns globally and in India. We evaluate five leading self‑supervised models (Whisper‑small/medium, XLS‑R, AST, mHuBERT), finding notable performance differences by language, with Whisper‑small performing best overall. Cross‑dataset generalization remains a challenge. By releasing SynHate and baseline code, we aim to advance robust, culturally sensitive, and multilingual solutions against synthetic hate speech. The dataset is available at https://www.iab‑rubric.org/resources.

Abstract:
Evaluating explainability techniques, such as SHAP and LRP, in the context of audio deepfake detection is challenging due to lack of clear ground truth annotations. In the cases when we are able to obtain the ground truth, we find that these methods struggle to provide accurate explanations. In this work, we propose a novel data‑driven approach to identify artifact regions in deepfake audio. We consider paired real and vocoded audio, and use the difference in time‑frequency representation as the ground‑truth explanation. The difference signal then serves as a supervision to train a diffusion model to expose the deepfake artifacts in a given vocoded audio. Experimental results on the VocV4 and LibriSeVoc datasets demonstrate that our method outperforms traditional explainability techniques, both qualitatively and quantitatively.

Abstract:
Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi‑class N‑pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score‑embedding fusion. The N‑pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score‑embedding fusion shows an optimal trade‑off between in‑domain and out‑of‑domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.

Abstract:
Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additionally, relying solely on a classification loss makes it difficult to capture the intrinsic differences between real and fake audio. In this paper, we propose the RPRA‑ADD, an integrated Reconstruction‑Perception‑Reinforcement‑Attention networks based forgery trace enhancement‑driven robust audio deepfake detection framework. First, we propose a Global‑Local Forgery Perception (GLFP) module for enhancing the acoustic perception capacity of forgery traces. To significantly reinforce the feature space distribution differences between real and fake audio, the Multi‑stage Dispersed Enhancement Loss (MDEL) is designed, which implements a dispersal strategy in multi‑stage feature spaces. Furthermore, in order to enhance feature awareness towards forgery traces, the Fake Trace Focused Attention (FTFA) mechanism is introduced to adjust attention weights dynamically according to the reconstruction discrepancy matrix. Visualization experiments not only demonstrate that FTFA improves attention to voice segments, but also enhance the generalization capability. Experimental results demonstrate that the proposed method achieves state‑of‑the‑art performance on 4 benchmark datasets, including ASVspoof2019, ASVspoof2021, CodecFake, and FakeSound, achieving over 20% performance improvement. In addition, it outperforms existing methods in rigorous 33 cross‑domain evaluations across Speech, Sound, and Singing, demonstrating strong generalization capability across diverse audio domains.

Abstract:
Recent advancements in Text‑to‑Speech (TTS) models, particularly in voice cloning, have intensified the demand for adaptable and efficient deepfake detection methods. As TTS systems continue to evolve, detection models must be able to efficiently adapt to previously unseen generation models with minimal data. This paper introduces ADD‑GP, a few‑shot adaptive framework based on a Gaussian Process (GP) classifier for Audio Deepfake Detection (ADD). We show how the combination of a powerful deep embedding model with the Gaussian processes flexibility can achieve strong performance and adaptability. Additionally, we show this approach can also be used for personalized detection, with greater robustness to new TTS models and one‑shot adaptability. To support our evaluation, a benchmark dataset is constructed for this task using new state‑of‑the‑art voice cloning models.

Abstract:
A key research area in deepfake speech detection is source tracing ‑ determining the origin of synthesised utterances. The approaches may involve identifying the acoustic model (AM), vocoder model (VM), or other generation‑specific parameters. However, progress is limited by the lack of a dedicated, systematically curated dataset. To address this, we introduce STOPA, a systematically varied and metadata‑rich dataset for deepfake speech source tracing, covering 8 AMs, 6 VMs, and diverse parameter settings across 700k samples from 13 distinct synthesisers. Unlike existing datasets, which often feature limited variation or sparse metadata, STOPA provides a systematically controlled framework covering a broader range of generative factors, such as the choice of the vocoder model, acoustic model, or pretrained weights, ensuring higher attribution reliability. This control improves attribution accuracy, aiding forensic analysis, deepfake detection, and generative model transparency.

Abstract:
We show how replay attacks undermine audio deepfake detection: By playing and re‑recording deepfake audio through various speakers and microphones, we make spoofed samples appear authentic to the detection model. To study this phenomenon in more detail, we introduce ReplayDF, a dataset of recordings derived from M‑AILABS and MLAAD, featuring 109 speaker‑microphone combinations across six languages and four TTS models. It includes diverse acoustic conditions, some highly challenging for detection. Our analysis of six open‑source detection models across five datasets reveals significant vulnerability, with the top‑performing W2V2‑AASIST model's Equal Error Rate (EER) surging from 4.7% to 18.2%. Even with adaptive Room Impulse Response (RIR) retraining, performance remains compromised with an 11.0% EER. We release ReplayDF for non‑commercial research use.

Abstract:
With the proliferation of speech deepfake generators, it becomes crucial not only to assess the authenticity of synthetic audio but also to trace its origin. While source attribution models attempt to address this challenge, they often struggle in open‑set conditions against unseen generators. In this paper, we introduce the source verification task, which, inspired by speaker verification, determines whether a test track was produced using the same model as a set of reference signals. Our approach leverages embeddings from a classifier trained for source attribution, computing distance scores between tracks to assess whether they originate from the same source. We evaluate multiple models across diverse scenarios, analyzing the impact of speaker diversity, language mismatch, and post‑processing operations. This work provides the first exploration of source verification, highlighting its potential and vulnerabilities, and offers insights for real‑world forensic applications.

Abstract:
Recent advances in speech deepfake detection (SDD) have significantly improved artifacts‑based detection in spoofed speech. However, most models overlook speech naturalness, a crucial cue for distinguishing bona fide speech from spoofed speech. This study proposes naturalness‑aware curriculum learning, a novel training framework that leverages speech naturalness to enhance the robustness and generalization of SDD. This approach measures sample difficulty using both ground‑truth labels and mean opinion scores, and adjusts the training schedule to progressively introduce more challenging samples. To further improve generalization, a dynamic temperature scaling method based on speech naturalness is incorporated into the training process. A 23% relative reduction in the EER was achieved in the experiments on the ASVspoof 2021 DF dataset, without modifying the model architecture. Ablation studies confirmed the effectiveness of naturalness‑aware training strategies for SDD tasks.

Abstract:
We propose BiCrossMamba‑ST, a robust framework for speech deepfake detection that leverages a dual‑branch spectro‑temporal architecture powered by bidirectional Mamba blocks and mutual cross‑attention. By processing spectral sub‑bands and temporal intervals separately and then integrating their representations, BiCrossMamba‑ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution‑based 2D attention map to focus on specific spectro‑temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba‑ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state‑of‑the‑art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.

Abstract:
This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio. These features are highly interpretable because of their close relationship with human articulatory processes and are expected to be more difficult for deepfake models to replicate. The results demonstrate that certain segmental features commonly used in forensic voice comparison (FVC) are effective in identifying deep‑fakes, whereas some global features provide little value. These findings underscore the need to approach audio deepfake detection using methods that are distinct from those employed in traditional FVC, and offer a new perspective on leveraging segmental features for this purpose. In addition, the present study proposes a speaker‑specific framework for deepfake detection, which differs fundamentally from the speaker‑independent systems that dominate current benchmarks. While speaker‑independent frameworks aim at broad generalization, the speaker‑specific approach offers advantages in forensic contexts where case‑by‑case interpretability and sensitivity to individual phonetic realization are essential.

Abstract:
Recent advances in neural audio codec‑based speech generation (CoSG) models have produced remarkably realistic audio deepfakes. We refer to deepfake speech generated by CoSG systems as codec‑based deepfake, or CodecFake. Although existing anti‑spoofing research on CodecFake predominantly focuses on verifying the authenticity of audio samples, almost no attention was given to tracing the CoSG used in generating these deepfakes. In CodecFake generation, processes such as speech‑to‑unit encoding, discrete unit modeling, and unit‑to‑speech decoding are fundamentally based on neural audio codecs. Motivated by this, we introduce source tracing for CodecFake via neural audio codec taxonomy, which dissects neural audio codecs to trace CoSG. Our experimental results on the CodecFake+ dataset provide promising initial evidence for the feasibility of CodecFake source tracing while also highlighting several challenges that warrant further investigation.

Abstract:
Deepfake audio detection is challenging for low‑resource languages like Bengali due to limited datasets and subtle acoustic features. To address this, we introduce BangalFake, a Bengali Deepfake Audio Dataset with 12,260 real and 13,260 deepfake utterances. Synthetic speech is generated using SOTA Text‑to‑Speech (TTS) models, ensuring high naturalness and quality. We evaluate the dataset through both qualitative and quantitative analyses. Mean Opinion Score (MOS) from 30 native speakers shows Robust‑MOS of 3.40 (naturalness) and 4.01 (intelligibility). t‑SNE visualization of MFCCs highlights real vs. fake differentiation challenges. This dataset serves as a crucial resource for advancing deepfake detection in Bengali, addressing the limitations of low‑resource language research.

Abstract:
The rapid evolution of generative AI has increased the threat of realistic audio‑visual deepfakes, demanding robust detection methods. Existing solutions primarily address unimodal (audio or visual) forgeries but struggle with multimodal manipulations due to inadequate handling of heterogeneous modality features and poor generalization across datasets. To this end, we propose a novel framework called FauForensics by introducing biologically invariant facial action units (FAUs), which is a quantitative descriptor of facial muscle activity linked to emotion physiology. It serves as forgery‑resistant representations that reduce domain dependency while capturing subtle dynamics often disrupted in synthetic content. Besides, instead of comparing entire video clips as in prior works, our method computes fine‑grained frame‑wise audiovisual similarities via a dedicated fusion module augmented with learnable cross‑modal queries. It dynamically aligns temporal‑spatial lip‑audio relationships while mitigating multi‑modal feature heterogeneity issues. Experiments on FakeAVCeleb and LAV‑DF show state‑of‑the‑art (SOTA) performance and superior cross‑dataset generalizability with up to an average of 4.83% than existing methods.

Abstract:
Deepfake audio presents a growing threat to digital security, due to its potential for social engineering, fraud, and identity misuse. However, existing detection models suffer from poor generalization across datasets, due to implicit identity leakage, where models inadvertently learn speaker‑specific features instead of manipulation artifacts. To the best of our knowledge, this is the first study to explicitly analyze and address identity leakage in the audio deepfake detection domain. This work proposes an identity‑independent audio deepfake detection framework that mitigates identity leakage by encouraging the model to focus on forgery‑specific artifacts instead of overfitting to speaker traits. Our approach leverages Artifact Detection Modules (ADMs) to isolate synthetic artifacts in both time and frequency domains, enhancing cross‑dataset generalization. We introduce novel dynamic artifact generation techniques, including frequency domain swaps, time domain manipulations, and background noise augmentation, to enforce learning of dataset‑invariant features. Extensive experiments conducted on ASVspoof2019, ADD 2022, FoR, and In‑The‑Wild datasets demonstrate that the proposed ADM‑enhanced models achieve F1 scores of 0.230 (ADD 2022), 0.604 (FoR), and 0.813 (In‑The‑Wild), consistently outperforming the baseline. Dynamic Frequency Swap proves to be the most effective strategy across diverse conditions. These findings emphasize the value of artifact‑based learning in mitigating implicit identity leakage for more generalizable audio deepfake detection.

Abstract:
Audio deepfakes represent a growing threat to digital security and trust, leveraging advanced generative models to produce synthetic speech that closely mimics real human voices. Detecting such manipulations is especially challenging under open‑world conditions, where spoofing methods encountered during testing may differ from those seen during training. In this work, we propose an end‑to‑end deep learning framework for audio deepfake detection that operates directly on raw waveforms. Our model, RawNetLite, is a lightweight convolutional‑recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing. To enhance robustness, we introduce a training strategy that combines data from multiple domains and adopts Focal Loss to emphasize difficult or ambiguous samples. We further demonstrate that incorporating codec‑based manipulations and applying waveform‑level audio augmentations (e.g., pitch shifting, noise, and time stretching) leads to significant generalization improvements under realistic acoustic conditions. The proposed model achieves over 99.7% F1 and 0.25% EER on in‑domain data (FakeOrReal), and up to 83.4% F1 with 16.4% EER on a challenging out‑of‑distribution test set (AVSpoof2021 + CodecFake). These findings highlight the importance of diverse training data, tailored objective functions and audio augmentations in building resilient and generalizable audio forgery detectors. Code and pretrained models are available at https://iplab.dmi.unict.it/mfs/Deepfakes/PaperRawNet2025/.

Abstract:
We introduce a real‑time, human‑in‑the‑loop gesture control framework that can dynamically adapt audio and music based on human movement by analyzing live video input. By creating a responsive connection between visual and auditory stimuli, this system enables dancers and performers to not only respond to music but also influence it through their movements. Designed for live performances, interactive installations, and personal use, it offers an immersive experience where users can shape the music in real time. The framework integrates computer vision and machine learning techniques to track and interpret motion, allowing users to manipulate audio elements such as tempo, pitch, effects, and playback sequence. With ongoing training, it achieves user‑independent functionality, requiring as few as 50 to 80 samples to label simple gestures. This framework combines gesture training, cue mapping, and audio manipulation to create a dynamic, interactive experience. Gestures are interpreted as input signals, mapped to sound control commands, and used to naturally adjust music elements, showcasing the seamless interplay between human interaction and machine response.

Abstract:
Generalizability, the capacity of a robust model to perform effectively on unseen data, is crucial for audio deepfake detection due to the rapid evolution of text‑to‑speech (TTS) and voice conversion (VC) technologies. A promising approach to differentiate between bonafide and spoof samples lies in identifying intrinsic disparities to enhance model generalizability. From an information‑theoretic perspective, we hypothesize the information content is one of the intrinsic differences: bonafide sample represents a dense, information‑rich sampling of the real world, whereas spoof sample is typically derived from lower‑dimensional, less informative representations. To implement this, we introduce frame‑level latent information entropy detector(f‑InfoED), a framework that extracts distinctive information entropy from latent representations at the frame level to identify audio deepfakes. Furthermore, we present AdaLAM, which extends large pre‑trained audio models with trainable adapters for enhanced feature extraction. To facilitate comprehensive evaluation, the audio deepfake forensics 2024 (ADFF 2024) dataset was built by the latest TTS and VC methods. Extensive experiments demonstrate that our proposed approach achieves state‑of‑the‑art performance and exhibits remarkable generalization capabilities. Further analytical studies confirms the efficacy of AdaLAM in extracting discriminative audio features and f‑InfoED in leveraging latent entropy information for more generalized deepfake detection.

Abstract:
The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single‑type audio deepfake detection (ADD), their performance declines in cross‑type scenarios. This paper is dedicated to studying the all‑type ADD task. We are the first to comprehensively establish an all‑type ADD benchmark to evaluate current CMs, incorporating cross‑type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self‑supervised learning (PT‑SSL) training paradigm, which optimizes SSL front‑end by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine‑tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)‑SSL method to capture type‑invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all‑type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co‑training. Experimental results demonstrate that WPT‑XLSR‑AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

Abstract:
The rise of AI‑driven generative models has enabled the creation of highly realistic speech deepfakes ‑ synthetic audio signals that can imitate target speakers' voices ‑ raising critical security concerns. Existing methods for detecting speech deepfakes primarily rely on supervised learning, which suffers from two critical limitations: limited generalization to unseen synthesis techniques and a lack of explainability. In this paper, we address these issues by introducing a novel interpretable one‑class detection framework, which reframes speech deepfake detection as an anomaly detection task. Our model is trained exclusively on real speech to characterize its distribution, enabling the classification of out‑of‑distribution samples as synthetically generated. Additionally, our framework produces interpretable anomaly maps during inference, highlighting anomalous regions across both time and frequency domains. This is done through a Student‑Teacher Feature Pyramid Matching system, enhanced with Discrepancy Scaling to improve generalization capabilities across unseen data distributions. Extensive evaluations demonstrate the superior performance of our approach compared to the considered baselines, validating the effectiveness of framing speech deepfake detection as an anomaly detection problem.

Abstract:
Deepfakes have become a universal and rapidly intensifying concern of generative AI across various media types such as images, audio, and videos. Among these, audio deepfakes have been of particular concern due to the ease of high‑quality voice synthesis and distribution via platforms such as social media and robocalls. Consequently, detecting audio deepfakes plays a critical role in combating the growing misuse of AI‑synthesized speech. However, real‑world scenarios often introduce various audio corruptions, such as noise, modification, and compression, that may significantly impact detection performance. This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. Using both traditional deep learning models and state‑of‑the‑art foundation models, we make four unique observations. First, our findings show that while most models demonstrate strong robustness to noise, they are notably more vulnerable to modifications and compression, especially when neural codecs are applied. Second, speech foundation models generally outperform traditional models across most scenarios, likely due to their self‑supervised learning paradigm and large‑scale pre‑training. Third, our results show that increasing model size improves robustness, albeit with diminishing returns. Fourth, we demonstrate how targeted data augmentation during training can enhance model resilience to unseen perturbations. A case study on political speech deepfakes highlights the effectiveness of foundation models in achieving high accuracy under real‑world conditions. These findings emphasize the importance of developing more robust detection frameworks to ensure reliability in practical deployment settings.

Abstract:
In this paper, we propose a deep neural network approach for deepfake speech detection (DSD) based on a lowcomplexity Depthwise‑Inception Network (DIN) trained with a contrastive training strategy (CTS). In this framework, input audio recordings are first transformed into spectrograms using Short‑Time Fourier Transform (STFT) and Linear Filter (LF), which are then used to train the DIN. Once trained, the DIN processes bonafide utterances to extract audio embeddings, which are used to construct a Gaussian distribution representing genuine speech. Deepfake detection is then performed by computing the distance between a test utterance and this distribution to determine whether the utterance is fake or bonafide. To evaluate our proposed systems, we conducted extensive experiments on the benchmark dataset of ASVspoof 2019 LA. The experimental results demonstrate the effectiveness of combining the Depthwise‑Inception Network with the contrastive learning strategy in distinguishing between fake and bonafide utterances. We achieved Equal Error Rate (EER), Accuracy (Acc.), F1, AUC scores of 4.6%, 95.4%, 97.3%, and 98.9% respectively using a single, low‑complexity DIN with just 1.77 M parameters and 985 M FLOPS on short audio segments (4 seconds). Furthermore, our proposed system outperforms the single‑system submissions in the ASVspoof 2019 LA challenge, showcasing its potential for real‑time applications.

Abstract:
Audio deepfakes are increasingly in‑differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low‑level audio features or optimization black‑box model training, focusing on the features that humans use to recognize speech will likely be a more long‑term robust approach to detection. We explore the use of prosody, or the high‑level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features‑based approach over existing models by applying an adaptive adversary using an L_\infty norm attack against the detectors and using attention mechanisms in our training for explainability. We show that we can explain the prosodic features that have highest impact on the model's decision (Jitter, Shimmer and Mean Fundamental Frequency) and that other models are extremely susceptible to simple L_\infty norm attacks (99.3% relative degradation in accuracy). While overall performance may be similar, we illustrate the robustness and explainability benefits to a prosody feature approach to audio deepfake detection.

Abstract:
Reliable detection of speech deepfakes (spoofs) must remain effective when the distribution of spoofing attacks shifts. We frame the task as domain generalization and show that inserting Low‑Rank Adaptation (LoRA) adapters into every attention head of a self‑supervised (SSL) backbone, then training only those adapters with Meta‑Learning Domain Generalization (MLDG), yields strong zero‑shot performance. The resulting model updates about 3.6 million parameters, roughly 1.1% of the 318 million updated in full fine‑tuning, yet surpasses a fully fine‑tuned counterpart on five of six evaluation corpora. A first‑order MLDG loop encourages the adapters to focus on cues that persist across attack types, lowering the average EER from 8.84% for the fully fine‑tuned model to 5.30% with our best MLDG‑LoRA configuration. Our findings show that combining meta‑learning with parameter‑efficient adaptation offers an effective method for zero‑shot, distribution‑shift‑aware speech deepfake detection.

Abstract:
Voice Authentication (VA), also known as Automatic Speaker Verification (ASV), is a widely adopted authentication method, particularly in automated systems like banking services, where it serves as a secondary layer of user authentication. Despite its popularity, VA systems are vulnerable to various attacks, including replay, impersonation, and the emerging threat of deepfake audio that mimics the voice of legitimate users. To mitigate these risks, several defense mechanisms have been proposed. One such solution, Voice Pops, aims to distinguish an individual's unique phoneme pronunciations during the enrollment process. While promising, the effectiveness of VA+VoicePop against a broader range of attacks, particularly logical or adversarial attacks, remains insufficiently explored. We propose a novel attack method, which we refer to as SyntheticPop, designed to target the phoneme recognition capabilities of the VA+VoicePop system. The SyntheticPop attack involves embedding synthetic "pop" noises into spoofed audio samples, significantly degrading the model's performance. We achieve an attack success rate of over 95% while poisoning 20% of the training dataset. Our experiments demonstrate that VA+VoicePop achieves 69% accuracy under normal conditions, 37% accuracy when subjected to a baseline label flipping attack, and just 14% accuracy under our proposed SyntheticPop attack, emphasizing the effectiveness of our method.

Abstract:
The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network‑based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross‑Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross‑attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic‑aligned time‑frequency (TF) masking loss that captures fine‑grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state‑of‑the‑art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

Abstract:
Audio deepfakes pose significant threats, including impersonation, fraud, and reputation damage. To address these risks, audio deepfake detection (ADD) techniques have been developed, demonstrating success on benchmarks like ASVspoof2019. However, their resilience against transferable adversarial attacks remains largely unexplored. In this paper, we introduce a transferable GAN‑based adversarial attack framework to evaluate the effectiveness of state‑of‑the‑art (SOTA) ADD systems. By leveraging an ensemble of surrogate ADD models and a discriminator, the proposed approach generates transferable adversarial attacks that better reflect real‑world scenarios. Unlike previous methods, the proposed framework incorporates a self‑supervised audio model to ensure transcription and perceptual integrity, resulting in high‑quality adversarial attacks. Experimental results on benchmark dataset reveal that SOTA ADD systems exhibit significant vulnerabilities, with accuracies dropping from 98% to 26%, 92% to 54%, and 94% to 84% in white‑box, gray‑box, and black‑box scenarios, respectively. When tested in other data sets, performance drops of 91% to 46%, and 94% to 67% were observed against the In‑the‑Wild and WaveFake data sets, respectively. These results highlight the significant vulnerabilities of existing ADD systems and emphasize the need to enhance their robustness against advanced adversarial threats to ensure security and reliability.

Abstract:
In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. In particular, the ability to create credible minute‑long synthetic music in a few seconds on user‑friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and artificial reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a AI‑music detector, a tool that will help in the regulation of synthetic media. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that getting a good test score is not the end of the story. We expose and discuss several facets that could be problematic with such a deployed detector: robustness to audio manipulation, generalisation to unseen models. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of artificial content checkers.

Abstract:
This paper proposes an audio‑visual deepfake detection approach that aims to capture fine‑grained temporal inconsistencies between audio and visual modalities. To achieve this, both architectural and data synthesis strategies are introduced. From an architectural perspective, a temporal distance map, coupled with an attention mechanism, is designed to capture these inconsistencies while minimizing the impact of irrelevant temporal subsequences. Moreover, we explore novel pseudo‑fake generation techniques to synthesize local inconsistencies. Our approach is evaluated against state‑of‑the‑art methods using the DFDC and FakeAVCeleb datasets, demonstrating its effectiveness in detecting audio‑visual deepfakes.

Abstract:
Audio deepfake detection is increasingly important as synthetic speech becomes more realistic and accessible. Recent methods, including those using graph neural networks (GNNs) to model frequency and temporal dependencies, show strong potential but need large amounts of labeled data, which limits their practical use. Label‑efficient alternatives like graph‑based non‑contrastive learning offer a potential solution, as they can learn useful representations from unlabeled data without using negative samples. However, current graph non‑contrastive approaches are built for single‑view graph representations and cannot be directly used for audio, which has unique spectral and temporal structures. Bridging this gap requires dual‑view graph modeling suited to audio signals. In this work, we introduce SIGNL (Spectral‑temporal vIsion Graph Non‑contrastive Learning), a label‑efficient expert system for detecting audio deepfakes. SIGNL operates on the visual representation of audio, such as spectrograms or other time‑frequency encodings, transforming them into spectral and temporal graphs for structured feature extraction. It then employs graph convolutional encoders to learn complementary frequency‑time features, effectively capturing the unique characteristics of audio. These encoders are pre‑trained using a non‑contrastive self‑supervised learning strategy on augmented graph pairs, enabling effective representation learning without labeled data. The resulting encoders are then fine‑tuned on minimal labelled data for downstream deepfake detection. SIGNL achieves strong performance on multiple audio deepfake detection benchmarks, including 7.88% EER on ASVspoof 2021 DF and 3.95% EER on ASVspoof 5 using only 5% labeled data. It also generalizes well to unseen conditions, reaching 10.16% EER on the In‑The‑Wild dataset when trained on CFAD.

Abstract:
Since the majority of audio DeepFake (DF) detection methods are trained on English‑centric datasets, their applicability to non‑English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra‑linguistic (same‑language) and cross‑linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.

Abstract:
Open environment oriented open set model attribution of deepfake audio is an emerging research topic, aiming to identify the generation models of deepfake audio. Most previous work requires manually setting a rejection threshold for unknown classes to compare with predicted probabilities. However, models often overfit training instances and generate overly confident predictions. Moreover, thresholds that effectively distinguish unknown categories in the current dataset may not be suitable for identifying known and unknown categories in another data distribution. To address the issues, we propose a novel framework for open set model attribution of deepfake audio with rejection threshold adaptation (ReTA). Specifically, the reconstruction error learning module trains by combining the representation of system fingerprints with labels corresponding to either the target class or a randomly chosen other class label. This process generates matching and non‑matching reconstructed samples, establishing the reconstruction error distributions for each class and laying the foundation for the reject threshold calculation module. The reject threshold calculation module utilizes gaussian probability estimation to fit the distributions of matching and non‑matching reconstruction errors. It then computes adaptive reject thresholds for all classes through probability minimization criteria. The experimental results demonstrate the effectiveness of ReTA in improving the open set model attributes of deepfake audio.

Abstract:
Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection ‑ the focus of this paper. Here we reveal that two of the most widely used audio‑video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio‑only and audio‑video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self‑supervised audio‑video representations we remove the risk of relying on dataset‑specific biases and improve robustness in deepfake detection.

Abstract:
Voice authentication on IoT‑enabled smart devices has gained prominence in recent years due to increasing concerns over user privacy and security. The current authentication systems are vulnerable to different voice‑spoofing attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic legitimate voices to deceive authentication systems and enable fraudulent activities (e.g., impersonation, unauthorized access, financial fraud, etc.). Existing solutions are often designed to tackle a single type of attack, leading to compromised performance against unseen attacks. On the other hand, existing unified voice anti‑spoofing solutions, not designed specifically for IoT, possess complex architectures and thus cannot be deployed on IoT‑enabled smart devices. Additionally, most of these unified solutions exhibit significant performance issues, including higher equal error rates or lower accuracy for specific attacks. To overcome these issues, we present the parallel stacked aggregation network (PSA‑Net), a lightweight framework designed as an anti‑spoofing defense system for voice‑controlled smart IoT devices. The PSA‑Net processes raw audios directly and eliminates the need for dataset‑dependent handcrafted features or pre‑computed spectrograms. Furthermore, PSA‑Net employs a split‑transform‑aggregate approach, which involves the segmentation of utterances, the extraction of intrinsic differentiable embeddings through convolutions, and the aggregation of them to distinguish legitimate from spoofed audios. In contrast to existing deep Resnet‑oriented solutions, we incorporate cardinality as an additional dimension in our network, which enhances the PSA‑Net ability to generalize across diverse attacks. The results show that the PSA‑Net achieves more consistent performance for different attacks that exist in current anti‑spoofing solutions.

Abstract:
Recent techniques for speech deepfake detection often rely on pre‑trained self‑supervised models. These systems, initially developed for Automatic Speech Recognition (ASR), have proved their ability to offer a meaningful representation of speech signals, which can benefit various tasks, including deepfake detection. In this context, pre‑trained models serve as feature extractors and are used to extract embeddings from input speech, which are then fed to a binary speech deepfake detector. The remarkable accuracy achieved through this approach underscores a potential relationship between ASR and speech deepfake detection. However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. We consider two different pre‑trained self‑supervised ASR models, Whisper and Wav2Vec 2.0, and adapt them for the speech deepfake detection task. These models have been released in multiple versions, with increasing number of parameters and enhanced ASR performance. We investigate whether performance improvements in ASR correlate with improvements in speech deepfake detection. Our results provide insights into the relationship between these two tasks and offer valuable guidance for the development of more effective speech deepfake detectors.

Abstract:
This paper evaluates the impact of training undergraduate students to improve their audio deepfake discernment ability by listening for expert‑defined linguistic features. Such features have been shown to improve performance of AI algorithms; here, we ascertain whether this improvement in AI algorithms also translates to improvement of the perceptual awareness and discernment ability of listeners. With humans as the weakest link in any cybersecurity solution, we propose that listener discernment is a key factor for improving trustworthiness of audio content. In this study we determine whether training that familiarizes listeners with English language variation can improve their abilities to discern audio deepfakes. We focus on undergraduate students, as this demographic group is constantly exposed to social media and the potential for deception and misinformation online. To the best of our knowledge, our work is the first study to uniquely address English audio deepfake discernment through such techniques. Our research goes beyond informational training by introducing targeted linguistic cues to listeners as a deepfake discernment mechanism, via a training module. In a pre‑/post‑ experimental design, we evaluated the impact of the training across 264 students as a representative cross section of all students at the University of Maryland, Baltimore County, and across experimental and control sections. Findings show that the experimental group showed a statistically significant decrease in their unsurety when evaluating audio clips and an improvement in their ability to correctly identify clips they were initially unsure about. While results are promising, future research will explore more robust and comprehensive trainings for greater impact.

Abstract:
Recent advances in AI‑generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox‑HQ, comprising 1.3 million samples, including 270,000 high‑quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high‑frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F‑SAT: Frequency‑Selective Adversarial Training method focusing on high‑frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state‑of‑the‑art RawNet3 model.

Abstract:
Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta‑learning, aiming to learn attack‑invariant features to adapt to unseen attacks with very few samples available. This approach is promising since generating of a high‑scale training dataset is often expensive or infeasible. Our experiments demonstrated an improvement in the Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset, using just 96 samples from the unseen dataset. Continuous few‑shot adaptation ensures that the system remains up‑to‑date.

Abstract:
Spoofed audio, i.e. audio that is manipulated or AI‑generated deepfake audio, is difficult to detect when only using acoustic features. Some recent innovative work involving AI‑spoofed audio detection models augmented with phonetic and phonological features of spoken English, manually annotated by experts, led to improved model performance. While this augmented model produced substantial improvements over traditional acoustic features based models, a scalability challenge motivates inquiry into auto labeling of features. In this paper we propose an AI framework, Audio‑Linguistic Data Augmentation for Spoofed audio detection (ALDAS), for auto labeling linguistic features. ALDAS is trained on linguistic features selected and extracted by sociolinguistics experts; these auto labeled features are used to evaluate the quality of ALDAS predictions. Findings indicate that while the detection enhancement is not as substantial as when involving the pure ground truth linguistic features, there is improvement in performance while achieving auto labeling. Labels generated by ALDAS are also validated by the sociolinguistics experts.

Abstract:
The rise of deepfake technologies has posed significant challenges to privacy, security, and information integrity, particularly in audio and multimedia content. This paper introduces a Quantum‑Trained Convolutional Neural Network (QT‑CNN) framework designed to enhance the detection of deepfake audio, leveraging the computational power of quantum machine learning (QML). The QT‑CNN employs a hybrid quantum‑classical approach, integrating Quantum Neural Networks (QNNs) with classical neural architectures to optimize training efficiency while reducing the number of trainable parameters. Our method incorporates a novel quantum‑to‑classical parameter mapping that effectively utilizes quantum states to enhance the expressive power of the model, achieving up to 70% parameter reduction compared to classical models without compromising accuracy. Data pre‑processing involved extracting essential audio features, label encoding, feature scaling, and constructing sequential datasets for robust model evaluation. Experimental results demonstrate that the QT‑CNN achieves comparable performance to traditional CNNs, maintaining high accuracy during training and testing phases across varying configurations of QNN blocks. The QT framework's ability to reduce computational overhead while maintaining performance underscores its potential for real‑world applications in deepfake detection and other resource‑constrained scenarios. This work highlights the practical benefits of integrating quantum computing into artificial intelligence, offering a scalable and efficient approach to advancing deepfake detection technologies.

Abstract:
The rapid proliferation of AI‑manipulated or generated audio deepfakes poses serious challenges to media integrity and election security. Current AI‑driven detection solutions lack explainability and underperform in real‑world settings. In this paper, we introduce novel explainability methods for state‑of‑the‑art transformer‑based audio deepfake detectors and open‑source a novel benchmark for real‑world generalizability. By narrowing the explainability gap between transformer‑based audio deepfake detectors and traditional methods, our results not only build trust with human experts, but also pave the way for unlocking the potential of citizen intelligence to overcome the scalability issue in audio deepfake detection.

Abstract:
Deepfakes pose a critical threat to biometric authentication systems by generating highly realistic synthetic media. Existing multimodal deepfake detectors often struggle to adapt to diverse data and rely on simple fusion methods. To address these challenges, we propose Gumbel‑Rao Monte Carlo Bi‑modal Neural Architecture Search (GRMC‑BMNAS), a novel architecture search framework that employs Gumbel‑Rao Monte Carlo sampling to optimize multimodal fusion. It refines the Straight through Gumbel Softmax (STGS) method by reducing variance with Rao‑Blackwellization, stabilizing network training. Using a two‑level search approach, the framework optimizes the network architecture, parameters, and performance. Crucial features are efficiently identified from backbone networks, while within the cell structure, a weighted fusion operation integrates information from various sources. By varying parameters such as temperature and number of Monte carlo samples yields an architecture that maximizes classification performance and better generalisation capability. Experimental results on the FakeAVCeleb and SWAN‑DF datasets demonstrate an impressive AUC percentage of 95.4%, achieved with minimal model parameters.

Abstract:
The rapid advancement of deepfake technology poses a significant threat to digital media integrity. Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. This creates risks of misinformation, fraud, and severe implications for personal privacy and security. Our research addresses the critical issue of deepfakes through an innovative multimodal approach, targeting both visual and auditory elements. This comprehensive strategy recognizes that human perception integrates multiple sensory inputs, particularly visual and auditory information, to form a complete understanding of media content. For visual analysis, a model that employs advanced feature extraction techniques was developed, extracting nine distinct facial characteristics and then applying various machine learning and deep learning models. For auditory analysis, our model leverages mel‑spectrogram analysis for feature extraction and then applies various machine learning and deep learningmodels. To achieve a combined analysis, real and deepfake audio in the original dataset were swapped for testing purposes and ensured balanced samples. Using our proposed models for video and audio classification i.e. Artificial Neural Network and VGG19, the overall sample is classified as deepfake if either component is identified as such. Our multimodal framework combines visual and auditory analyses, yielding an accuracy of 94%.

Abstract:
In speech deepfake detection, one of the critical aspects is developing detectors able to generalize on unseen data and distinguish fake signals across different datasets. Common approaches to this challenge involve incorporating diverse data into the training process or fine‑tuning models on unseen datasets. However, these solutions can be computationally demanding and may lead to the loss of knowledge acquired from previously learned data. Continual learning techniques offer a potential solution to this problem, allowing the models to learn from unseen data without losing what they have already learned. Still, the optimal way to apply these algorithms for speech deepfake detection remains unclear, and we do not know which is the best way to apply these algorithms to the developed models. In this paper we address this aspect and investigate whether, when retraining a speech deepfake detector, it is more effective to apply continual learning across the entire model or to update only some of its layers while freezing others. Our findings, validated across multiple models, indicate that the most effective approach among the analyzed ones is to update only the weights of the initial layers, which are responsible for processing the input features of the detector.

Abstract:
Speech deepfakes pose a significant threat to personal security and content authenticity. Several detectors have been proposed in the literature, and one of the primary challenges these systems have to face is the generalization over unseen data to identify fake signals across a wide range of datasets. In this paper, we introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture. The Mixture of Experts framework is well‑suited for the speech deepfake detection task due to its ability to specialize in different input types and handle data variability efficiently. This approach offers superior generalization and adaptability to unseen data compared to traditional single models or ensemble methods. Additionally, its modular structure supports scalable updates, making it more flexible in managing the evolving complexity of deepfake techniques while maintaining high detection accuracy. We propose an efficient, lightweight gating mechanism to dynamically assign expert weights for each input, optimizing detection performance. Experimental results across multiple datasets demonstrate the effectiveness and potential of our proposed approach.

Abstract:
Text‑to‑Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real‑world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five‑language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti‑deepfake and anti‑content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

Abstract:
Mainstream zero‑shot TTS production systems like Voicebox and Seed‑TTS achieve human parity speech by leveraging Flow‑matching and Diffusion models, respectively. Unfortunately, human‑level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state‑of‑the‑art anti‑spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow‑matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti‑spoofing models lack sufficient robustness against highly human‑like audio generated by diffusion and flow‑matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti‑spoofing models.

Abstract:
This paper describes our submitted systems to the ASVspoof 5 Challenge Track 1: Speech Deepfake Detection ‑ Open Condition, which consists of a stand‑alone speech deepfake (bonafide vs spoof) detection task. Recently, large‑scale self‑supervised models become a standard in Automatic Speech Recognition (ASR) and other speech processing tasks. Thus, we leverage a pre‑trained WavLM as a front‑end model and pool its representations with different back‑end techniques. The complete framework is fine‑tuned using only the trained dataset of the challenge, similar to the close condition. Besides, we adopt data‑augmentation by adding noise and reverberation using MUSAN noise and RIR datasets. We also experiment with codec augmentations to increase the performance of our method. Ultimately, we use the Bosaris toolkit for score calibration and system fusion to get better Cllr scores. Our fused system achieves 0.0937 minDCF, 3.42% EER, 0.1927 Cllr, and 0.1375 actDCF.

Abstract:
This paper describes the USTC‑KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing‑robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend feature extractor and a back‑end classifier. We focus on extensive embedding engineering and enhancing the generalization of the back‑end classifier model. Specifically, the embedding engineering is based on hand‑crafted features and speech representations from a self‑supervised model, used for closed and open conditions, respectively. To detect spoof attacks under various adversarial conditions, we trained multiple systems on an augmented training set. Additionally, we used voice conversion technology to synthesize fake audio from genuine audio in the training set to enrich the synthesis algorithms. To leverage the complementary information learned by different model architectures, we employed activation ensemble and fused scores from different systems to obtain the final decision score for spoof detection. During the evaluation phase, the proposed methods achieved 0.3948 minDCF and 14.33% EER in the close condition, and 0.0750 minDCF and 2.59% EER in the open condition, demonstrating the robustness of our submitted systems under adversarial conditions. In Track 2, we continued using the CM system from Track 1 and fused it with a CNN‑based ASV system. This approach achieved 0.2814 min‑aDCF in the closed condition and 0.0756 min‑aDCF in the open condition, showcasing superior performance in the SASV system.