arXiv Papers of Anti-Spoofing

Paperid: 1, https://arxiv.org/pdf/2512.07352.pdf   GitHub GitHub
Authors:Xueping Zhang, Zhenshan Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li
Title: MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection
Abstract:
Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Based on this dataset, we define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. We further propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have released.
Authors:Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren
Title: Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
Abstract:
With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.
Authors:Zhenshan Zhang, Xueping Zhang, Yechen Wang, Liwei Jin, Ming Li
Title: The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures
Abstract:
This paper presents the first study on the impact of audio watermarking on spoofing countermeasures. While anti-spoofing systems are essential for securing speech-based applications, the influence of widely used audio watermarking, originally designed for copyright protection, remains largely unexplored. We construct watermark-augmented training and evaluation datasets, named the Watermark-Spoofing dataset, by applying diverse handcrafted and neural watermarking methods to existing anti-spoofing datasets. Experiments show that watermarking consistently degrades anti-spoofing performance, with higher watermark density correlating with higher Equal Error Rates (EERs). To mitigate this, we propose the Knowledge-Preserving Watermark Learning (KPWL) framework, enabling models to adapt to watermark-induced shifts while preserving their original-domain spoofing detection capability. These findings reveal audio watermarking as a previously overlooked domain shift and establish the first benchmark for developing watermark-resilient anti-spoofing systems. All related protocols are publicly available at https://github.com/Alphawarheads/Watermark_Spoofing.git
Authors:Xueping Zhang, Liwei Jin, Yechen Wang, Linxi Li, Ming Li
Title: CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-spoofing Countermeasures
Abstract:
Component-level audio Spoofing (Comp-Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti-spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component-level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation-enhanced joint learning framework that separates audio components apart and applies anti-spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.
Authors:Jeongmin Yu, Susang Kim, Kisu Lee, Taekyoung Kwon, Won-Yong Shin, Ha Young Kim
Title: Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing
Abstract:
Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.
Authors:Andrei Balykin, Anvar Ganiev, Denis Kondranin, Kirill Polevoda, Nikolai Liudkevich, Artem Petrov
Title: Paired-Sampling Contrastive Framework for Joint Physical-Digital Face Attack Detection
Abstract:
Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired-Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Evaluated on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real-world deployment. Code and pretrained models are available at https://github.com/xPONYx/iccv2025_deepfake_challenge.
Authors:Kun-Hsiang Lin, Yu-Wen Tseng, Kang-Yang Huang, Jhih-Ciang Wu, Wen-Huang Cheng
Title: InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing
Abstract:
Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.
Authors:Ivan Viakhirev, Daniil Sirota, Aleksandr Smirnov, Kirill Borodin
Title: Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection
Abstract:
Advances in voice conversion and text-to-speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti-spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self-supervised speech representations in limited-data settings, substitutes the original graph attention block with a standardized multi-head attention module using heterogeneous query projections, and replaces heuristic frame-segment fusion with a trainable, context-aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6\% equal error rate (EER), improving on a re-implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.
Authors:Seungjin Jung, Kanghee Lee, Yonghyun Jeong, Haeun Noh, Jungmin Lee, Jongwon Choi
Title: Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing
Abstract:
Domain Generalizable Face Anti-Spoofing (DGFAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM). Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.
Authors:Nidheesh Gorthi, Kartik Thakral, Rishabh Ranjan, Richa Singh, Mayank Vatsa
Title: LitMAS: A Lightweight and Generalized Multi-Modal Anti-Spoofing Framework for Biometric Security
Abstract:
Biometric authentication systems are increasingly being deployed in critical applications, but they remain susceptible to spoofing. Since most of the research efforts focus on modality-specific anti-spoofing techniques, building a unified, resource-efficient solution across multiple biometric modalities remains a challenge. To address this, we propose LitMAS, a $\textbf{Li}$gh$\textbf{t}$ weight and generalizable $\textbf{M}$ulti-modal $\textbf{A}$nti-$\textbf{S}$poofing framework designed to detect spoofing attacks in speech, face, iris, and fingerprint-based biometric systems. At the core of LitMAS is a Modality-Aligned Concentration Loss, which enhances inter-class separability while preserving cross-modal consistency and enabling robust spoof detection across diverse biometric traits. With just 6M parameters, LitMAS surpasses state-of-the-art methods by $1.36\%$ in average EER across seven datasets, demonstrating high efficiency, strong generalizability, and suitability for edge deployment. Code and trained models are available at https://github.com/IAB-IITJ/LitMAS.
Authors:Mika Feng, Koichi Ito, Takafumi Aoki, Tetsushi Ohki, Masakatsu Nishigaki
Title: Leveraging Intermediate Features of Vision Transformer for Face Anti-Spoofing
Abstract:
Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets. The project page is available at: https://gsisaoki.github.io/FAS-ViT-CVPRW/ .
Authors:Aurosweta Mahapatra, Ismail Rasim Ulgen, Abinay Reddy Naini, Carlos Busso, Berrak Sisman
Title: Can Emotion Fool Anti-spoofing?
Abstract:
Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a corpus of emotional text-to-speech samples. Our analysis shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. Even trained on emotional data, the models underperform due to limited focus on emotional aspect and show performance disparities across emotions. This highlights the need for emotion-focused anti-spoofing paradigm in both dataset and methodology. We propose GEM, a gated ensemble of emotion-specialized models with a speech emotion recognition gating network. GEM performs effectively across all emotions and neutral state, improving defenses against spoofing attacks. We release the EmoSpoof-TTS Dataset: https://emospoof-tts.github.io/Dataset/
Authors:Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li
Title: Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing
Abstract:
Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.
Authors:Shaocong Long, Qianyu Zhou, Xiangtai Li, Chenhao Ying, Yunhai Tong, Lizhuang Ma, Yuan Luo, Dacheng Tao
Title: Generative Classifier for Domain Generalization
Abstract:
Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain-specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain-invariant features, it struggles when confronted with diverse domain-specific information, e.g., intra-class shifts, that exhibits multi-modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain-specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier-driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain-specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain-specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG's comparable performance on five DG benchmarks and one face anti-spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.
Authors:Jingyi Yang, Xun Lin, Zitong Yu, Liepiao Zhang, Xin Liu, Hui Li, Xiaochen Yuan, Xiaochun Cao
Title: DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing
Abstract:
With the availability of diverse sensor modalities (i.e., RGB, Depth, Infrared) and the success of multi-modal learning, multi-modal face anti-spoofing (FAS) has emerged as a prominent research focus. The intuition behind it is that leveraging multiple modalities can uncover more intrinsic spoofing traces. However, this approach presents more risk of misalignment. We identify two main types of misalignment: (1) \textbf{Intra-domain modality misalignment}, where the importance of each modality varies across different attacks. For instance, certain modalities (e.g., Depth) may be non-defensive against specific attacks (e.g., 3D mask), indicating that each modality has unique strengths and weaknesses in countering particular attacks. Consequently, simple fusion strategies may fall short. (2) \textbf{Inter-domain modality misalignment}, where the introduction of additional modalities exacerbates domain shifts, potentially overshadowing the benefits of complementary fusion. To tackle (1), we propose a alignment module between modalities based on mutual information, which adaptively enhances favorable modalities while suppressing unfavorable ones. To address (2), we employ a dual alignment optimization method that aligns both sub-domain hyperplanes and modality angle margins, thereby mitigating domain gaps. Our method, dubbed \textbf{D}ual \textbf{A}lignment of \textbf{D}omain and \textbf{M}odality (DADM), achieves state-of-the-art performance in extensive experiments across four challenging protocols demonstrating its robustness in multi-modal domain generalization scenarios. The codes will be released soon.
Authors:Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Fadi Boutros, Raghavendra Ramachandra, Naser Damer
Title: FoundPAD: Foundation Models Reloaded for Face Presentation Attack Detection
Abstract:
Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre-trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task-specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at https://github.com/gurayozgur/FoundPAD .
Authors:Yingjie Ma, Zitong Yu, Xun Lin, Weicheng Xie, Linlin Shen
Title: BIG-MoE: Bypass Isolated Gating MoE for Generalized Multimodal Face Anti-Spoofing
Abstract:
In the domain of facial recognition security, multimodal Face Anti-Spoofing (FAS) is essential for countering presentation attacks. However, existing technologies encounter challenges due to modality biases and imbalances, as well as domain shifts. Our research introduces a Mixture of Experts (MoE) model to address these issues effectively. We identified three limitations in traditional MoE approaches to multimodal FAS: (1) Coarse-grained experts' inability to capture nuanced spoofing indicators; (2) Gated networks' susceptibility to input noise affecting decision-making; (3) MoE's sensitivity to prompt tokens leading to overfitting with conventional learning methods. To mitigate these, we propose the Bypass Isolated Gating MoE (BIG-MoE) framework, featuring: (1) Fine-grained experts for enhanced detection of subtle spoofing cues; (2) An isolation gating mechanism to counteract input noise; (3) A novel differential convolutional prompt bypass enriching the gating network with critical local features, thereby improving perceptual capabilities. Extensive experiments on four benchmark datasets demonstrate significant generalization performance improvement in multimodal FAS task. The code is released at https://github.com/murInJ/BIG-MoE.
Authors:Debasmita Pal, Redwan Sony, Arun Ross
Title: A Parametric Approach to Adversarial Augmentation for Cross-Domain Iris Presentation Attack Detection
Abstract:
Iris-based biometric systems are vulnerable to presentation attacks (PAs), where adversaries present physical artifacts (e.g., printed iris images, textured contact lenses) to defeat the system. This has led to the development of various presentation attack detection (PAD) algorithms, which typically perform well in intra-domain settings. However, they often struggle to generalize effectively in cross-domain scenarios, where training and testing employ different sensors, PA instruments, and datasets. In this work, we use adversarial training samples of both bonafide irides and PAs to improve the cross-domain performance of a PAD classifier. The novelty of our approach lies in leveraging transformation parameters from classical data augmentation schemes (e.g., translation, rotation) to generate adversarial samples. We achieve this through a convolutional autoencoder, ADV-GEN, that inputs original training samples along with a set of geometric and photometric transformations. The transformation parameters act as regularization variables, guiding ADV-GEN to generate adversarial samples in a constrained search space. Experiments conducted on the LivDet-Iris 2017 database, comprising four datasets, and the LivDet-Iris 2020 dataset, demonstrate the efficacy of our proposed method. The code is available at https://github.com/iPRoBe-lab/ADV-GEN-IrisPAD.
Authors:Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev
Title: Discovering Spoofing Attempts on Language Model Watermarks
Abstract:
LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat. We make all our code available at https://github.com/eth-sri/watermark-spoofing-detection .
Authors:Zonghui Guo, Yingjie Liu, Jie Zhang, Haiyong Zheng, Shiguang Shan
Title: Face Forgery Detection with Elaborate Backbone
Abstract:
Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. This severe challenge requires FFD models to possess strong capabilities in representing complex facial features and extracting subtle forgery cues. Although previous FFD models directly employ existing backbones to represent and extract facial forgery cues, the critical role of backbones is often overlooked, particularly as their knowledge and capabilities are insufficient to address FFD challenges, inevitably limiting generalization. Therefore, it is essential to integrate the backbone pre-training configurations and seek practical solutions by revisiting the complete FFD workflow, from backbone pre-training and fine-tuning to inference of discriminant results. Specifically, we analyze the crucial contributions of backbones with different configurations in FFD task and propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone, equipping it with superior facial representation capabilities. We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our FFD model with the elaborate backbone achieves excellent performance in FFD and extra face-related tasks, i.e., presentation attack detection. Code and models are available at https://github.com/zhenglab/FFDBackbone.
Authors:Xinxu Ge, Xin Liu, Zitong Yu, Jingang Shi, Chun Qi, Jie Li, Heikki Kälviäinen
Title: DiffFAS: Face Anti-Spoofing via Generative Diffusion Models
Abstract:
Face anti-spoofing (FAS) plays a vital role in preventing face recognition (FR) systems from presentation attacks. Nowadays, FAS systems face the challenge of domain shift, impacting the generalization performance of existing FAS methods. In this paper, we rethink about the inherence of domain shift and deconstruct it into two factors: image style and image quality. Quality influences the purity of the presentation of spoof information, while style affects the manner in which spoof information is presented. Based on our analysis, we propose DiffFAS framework, which quantifies quality as prior information input into the network to counter image quality shift, and performs diffusion-based high-fidelity cross-domain and cross-attack types generation to counter image style shift. DiffFAS transforms easily collectible live faces into high-fidelity attack faces with precise labels while maintaining consistency between live and spoof face identities, which can also alleviate the scarcity of labeled data with novel type attacks faced by nowadays FAS system. We demonstrate the effectiveness of our framework on challenging cross-domain and cross-attack FAS datasets, achieving the state-of-the-art performance. Available at https://github.com/murphytju/DiffFAS.
Authors:Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex Kot
Title: Towards Data-Centric Face Anti-Spoofing: Improving Cross-domain Generalization via Physics-based Data Synthesis
Abstract:
Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, \textit{etc}. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at https://github.com/RizhaoCai/FAS_Aug.
Authors:Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang
Title: Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
Abstract:
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.
Authors:Haoyuan Zhang, Xiangyu Zhu, Li Gao, Jiawei Pan, Kai Pang, Guoying Zhao, Zhen Lei
Title: Spoof Trace Discovery for Deep Learning Based Explainable Face Anti-Spoofing
Abstract:
With the rapid growth usage of face recognition in people's daily life, face anti-spoofing becomes increasingly important to avoid malicious attacks. Recent face anti-spoofing models can reach a high classification accuracy on multiple datasets but these models can only tell people "this face is fake" while lacking the explanation to answer "why it is fake". Such a system undermines trustworthiness and causes user confusion, as it denies their requests without providing any explanations. In this paper, we incorporate XAI into face anti-spoofing and propose a new problem termed X-FAS (eXplainable Face Anti-Spoofing) empowering face anti-spoofing models to provide an explanation. We propose SPTD (SPoof Trace Discovery), an X-FAS method which can discover spoof concepts and provide reliable explanations on the basis of discovered concepts. To evaluate the quality of X-FAS methods, we propose an X-FAS benchmark with annotated spoof traces by experts. We analyze SPTD explanations on face anti-spoofing dataset and compare SPTD quantitatively and qualitatively with previous XAI methods on proposed X-FAS benchmark. Experimental results demonstrate SPTD's ability to generate reliable explanations.
Authors:Guosheng Zhang, Keyao Wang, Haixiao Yue, Ajian Liu, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang
Title: Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models
Abstract:
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.
Authors:Haoze Li, Jie Zhang, Guoying Zhao, Stephen Lin, Shiguang Shan
Title: Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection
Abstract:
Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.
Authors:Xingming Long, Jie Zhang, Shiguang Shan
Title: Confidence Aware Learning for Reliable Face Anti-spoofing
Abstract:
Current Face Anti-spoofing (FAS) models tend to make overly confident predictions even when encountering unfamiliar scenarios or unknown presentation attacks, which leads to serious potential risks. To solve this problem, we propose a Confidence Aware Face Anti-spoofing (CA-FAS) model, which is aware of its capability boundary, thus achieving reliable liveness detection within this boundary. To enable the CA-FAS to "know what it doesn't know", we propose to estimate its confidence during the prediction of each sample. Specifically, we build Gaussian distributions for both the live faces and the known attacks. The prediction confidence for each sample is subsequently assessed using the Mahalanobis distance between the sample and the Gaussians for the "known data". We further introduce the Mahalanobis distance-based triplet mining to optimize the parameters of both the model and the constructed Gaussians as a whole. Extensive experiments show that the proposed CA-FAS can effectively recognize samples with low prediction confidence and thus achieve much more reliable performance than other FAS models by filtering out samples that are beyond its reliable range.
Authors:Tahar Chettaoui, Naser Damer, Fadi Boutros
Title: Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications
Abstract:
Foundation models such as CLIP have demonstrated exceptional zero- and few-shot transfer capabilities across diverse vision tasks. However, when fine-tuned for highly specialized biometric tasks, face recognition (FR), morphing attack detection (MAD), and presentation attack detection (PAD), these models may suffer from over-specialization. Thus, they may lose one of their foundational strengths, cross-domain generalization. In this work, we systematically quantify these trade-offs by evaluating three instances of CLIP fine-tuned for FR, MAD, and PAD. We evaluate each adapted model as well as the original CLIP baseline on 14 general vision datasets under zero-shot and linear-probe protocols, alongside common FR, MAD, and PAD benchmarks. Our results indicate that fine-tuned models suffer from over-specialization, especially when fine-tuned for complex tasks of FR. Also, our results pointed out that task complexity and classification head design, multi-class (FR) vs. binary (MAD and PAD), correlate with the degree of catastrophic forgetting. The FRoundation model with the ViT-L backbone outperforms other approaches on the large-scale FR benchmark IJB-C, achieving an improvement of up to 58.52%. However, it experiences a substantial performance drop on ImageNetV2, reaching only 51.63% compared to 69.84% achieved by the baseline CLIP model. Moreover, the larger CLIP architecture consistently preserves more of the model's original generalization ability than the smaller variant, indicating that increased model capacity may help mitigate over-specialization.
Authors:Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi
Title: Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches
Abstract:
In real-world applications, it is challenging to build a speaker verification system that is simultaneously robust against common threats, including spoofing attacks, channel mismatch, and domain mismatch. Traditional automatic speaker verification (ASV) systems often tackle these issues separately, leading to suboptimal performance when faced with simultaneous challenges. In this paper, we propose an integrated framework that incorporates pair-wise learning and spoofing attack simulation into the meta-learning paradigm to enhance robustness against these multifaceted threats. This novel approach employs an asymmetric dual-path model and a multi-task learning strategy to handle ASV, anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing dataset, CNComplex, is introduced to evaluate system performance under these combined threats. Experimental results demonstrate that our integrated model significantly improves performance over traditional ASV systems across various scenarios, showcasing its potential for real-world deployment. Additionally, the proposed framework's ability to generalize across different conditions highlights its robustness and reliability, making it a promising solution for practical ASV applications.
Authors:Jiabao Guo, Ajian Liu, Yunfeng Diao, Jin Zhang, Hui Ma, Bo Zhao, Richang Hong, Meng Wang
Title: Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering
Abstract:
The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.
Authors:Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Xiu Li
Title: InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation
Abstract:
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.
Authors:Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong
Title: XMUspeech Systems for the ASVspoof 5 Challenge
Abstract:
In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM-Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact-related information, we trained self-supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi-scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand-crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one-class loss functions and provided optimized configurations to better align with the anti-spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition.
Authors:Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
Title: Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy
Abstract:
Recent advances in neural audio codec-based speech generation (CoSG) models have produced remarkably realistic audio deepfakes. We refer to deepfake speech generated by CoSG systems as codec-based deepfake, or CodecFake. Although existing anti-spoofing research on CodecFake predominantly focuses on verifying the authenticity of audio samples, almost no attention was given to tracing the CoSG used in generating these deepfakes. In CodecFake generation, processes such as speech-to-unit encoding, discrete unit modeling, and unit-to-speech decoding are fundamentally based on neural audio codecs. Motivated by this, we introduce source tracing for CodecFake via neural audio codec taxonomy, which dissects neural audio codecs to trace CoSG. Our experimental results on the CodecFake+ dataset provide promising initial evidence for the feasibility of CodecFake source tracing while also highlighting several challenges that warrant further investigation.
Authors:Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee
Title: CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset
Abstract:
With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.
Authors:Jiawei Du, I-Ming Lin, I-Hsiang Chiu, Xuanjun Chen, Haibin Wu, Wenze Ren, Yu Tsao, Hung-yi Lee, Jyh-Shing Roger Jang
Title: DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset
Abstract:
Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state-of-the-art anti-spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow-matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti-spoofing models lack sufficient robustness against highly human-like audio generated by diffusion and flow-matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti-spoofing models.
Authors:Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Chao-Han Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu
Title: Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits
Abstract:
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.
Authors:Abyad Enan, Mashrur Chowdhury, Sagar Dasgupta, Mizanur Rahman
Title: Quantum-Classical Hybrid Framework for Zero-Day Time-Push GNSS Spoofing Detection
Abstract:
Global Navigation Satellite Systems (GNSS) are critical for Positioning, Navigation, and Timing (PNT) applications. However, GNSS are highly vulnerable to spoofing attacks, where adversaries transmit counterfeit signals to mislead receivers. Such attacks can lead to severe consequences, including misdirected navigation, compromised data integrity, and operational disruptions. Most existing spoofing detection methods depend on supervised learning techniques and struggle to detect novel, evolved, and unseen attacks. To overcome this limitation, we develop a zero-day spoofing detection method using a Hybrid Quantum-Classical Autoencoder (HQC-AE), trained solely on authentic GNSS signals without exposure to spoofed data. By leveraging features extracted during the tracking stage, our method enables proactive detection before PNT solutions are computed. We focus on spoofing detection in static GNSS receivers, which are particularly susceptible to time-push spoofing attacks, where attackers manipulate timing information to induce incorrect time computations at the receiver. We evaluate our model against different unseen time-push spoofing attack scenarios: simplistic, intermediate, and sophisticated. Our analysis demonstrates that the HQC-AE consistently outperforms its classical counterpart, traditional supervised learning-based models, and existing unsupervised learning-based methods in detecting zero-day, unseen GNSS time-push spoofing attacks, achieving an average detection accuracy of 97.71% with an average false negative rate of 0.62% (when an attack occurs but is not detected). For sophisticated spoofing attacks, the HQC-AE attains an accuracy of 98.23% with a false negative rate of 1.85%. These findings highlight the effectiveness of our method in proactively detecting zero-day GNSS time-push spoofing attacks across various stationary GNSS receiver platforms.
Authors:Mengze Hong, Di Jiang, Zeying Xie, Weiwei Zhao, Guan Wang, Chen Jason Zhang
Title: Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis
Abstract:
As audio deepfakes transition from research artifacts to widely available commercial tools, robust biometric authentication faces pressing security threats in high-stakes industries. This paper presents a systematic empirical evaluation of state-of-the-art speaker authentication systems based on a large-scale speech synthesis dataset, revealing two major security vulnerabilities: 1) modern voice cloning models trained on very small samples can easily bypass commercial speaker verification systems; and 2) anti-spoofing detectors struggle to generalize across different methods of audio synthesis, leading to a significant gap between in-domain performance and real-world robustness. These findings call for a reconsideration of security measures and stress the need for architectural innovations, adaptive defenses, and the transition towards multi-factor authentication.
Authors:Gaojian Wang, Feng Lin, Tong Wu, Zhenguang Liu, Zhongjie Ba, Kui Ren
Title: FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning
Abstract:
This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.
Authors:Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li
Title: ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan
Abstract:
Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).
Authors:Yang Xiao, Ting Dang, Rohan Kumar Das
Title: RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing
Abstract:
Automatic speaker verification (ASV) systems are often affected by spoofing attacks. Recent transformer-based models have improved anti-spoofing performance by learning strong feature representations. However, these models usually need high computing power. To address this, we introduce RawTFNet, a lightweight CNN model designed for audio signals. The RawTFNet separates feature processing along time and frequency dimensions, which helps to capture the fine-grained details of synthetic speech. We tested RawTFNet on the ASVspoof 2021 LA and DF evaluation datasets. The results show that RawTFNet reaches comparable performance to that of the state-of-the-art models, while also using fewer computing resources. The code and models will be made publicly available.
Authors:Bikash Dutta, Rishabh Ranjan, Shyam Sathvik, Mayank Vatsa, Richa Singh
Title: Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?
Abstract:
Quantization is essential for deploying large audio language models (LALMs) efficiently in resource-constrained environments. However, its impact on complex tasks, such as zero-shot audio spoofing detection, remains underexplored. This study evaluates the zero-shot capabilities of five LALMs, GAMA, LTU-AS, MERaLiON, Qwen-Audio, and SALMONN, across three distinct datasets: ASVspoof2019, In-the-Wild, and WaveFake, and investigates their robustness to quantization (FP32, FP16, INT8). Despite high initial spoof detection accuracy, our analysis demonstrates severe predictive biases toward spoof classification across all models, rendering their practical performance equivalent to random classification. Interestingly, quantization to FP16 precision resulted in negligible performance degradation compared to FP32, effectively halving memory and computational requirements without materially impacting accuracy. However, INT8 quantization intensified model biases, significantly degrading balanced accuracy. These findings highlight critical architectural limitations and emphasize FP16 quantization as an optimal trade-off, providing guidelines for practical deployment and future model refinement.
Authors:Hieu-Thi Luong, Inbal Rimon, Haim Permuter, Kong Aik Lee, Eng Siong Chng
Title: Robust Localization of Partially Fake Speech: Metrics and Out-of-Domain Evaluation
Abstract:
Partial audio deepfake localization poses unique challenges and remain underexplored compared to full-utterance spoofing detection. While recent methods report strong in-domain performance, their real-world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures generalization and deployment readiness. We propose reframing the localization task as a sequential anomaly detection problem and advocate for the use of threshold-dependent metrics such as accuracy, precision, recall, and F1-score, which better reflect real-world behavior. Specifically, we analyze the performance of the open-source Coarse-to-Fine Proposal Refinement Framework (CFPRF), which achieves a 20-ms EER of 7.61% on the in-domain PartialSpoof evaluation set, but 43.25% and 27.59% on the LlamaPartialSpoof and Half-Truth out-of-domain test sets. Interestingly, our reproduced version of the same model performs worse on in-domain data (9.84%) but better on the out-of-domain sets (41.72% and 14.98%, respectively). This highlights the risks of over-optimizing for in-domain EER, which can lead to models that perform poorly in real-world scenarios. It also suggests that while deep learning models can be effective on in-domain data, they generalize poorly to out-of-domain scenarios, failing to detect novel synthetic samples and misclassifying unfamiliar bona fide audio. Finally, we observe that adding more bona fide or fully synthetic utterances to the training data often degrades performance, whereas adding partially fake utterances improves it.
Authors:Yingjie Ma, Xun Lin, Zitong Yu, Xin Liu, Xiaochen Yuan, Weicheng Xie, Linlin Shen
Title: Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing
Abstract:
Face Anti-Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios such as payment processing and surveillance. Current multimodal FAS methods often struggle with effective generalization, mainly due to modality-specific biases and domain shifts. To address these challenges, we introduce the \textbf{M}ulti\textbf{m}odal \textbf{D}enoising and \textbf{A}lignment (\textbf{MMDA}) framework. By leveraging the zero-shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data through denoising and alignment mechanisms, thereby significantly enhancing the generalization performance of cross-modal alignment. The \textbf{M}odality-\textbf{D}omain Joint \textbf{D}ifferential \textbf{A}ttention (\textbf{MD2A}) module in MMDA concurrently mitigates the impacts of domain and modality noise by refining the attention mechanism based on extracted common noise features. Furthermore, the \textbf{R}epresentation \textbf{S}pace \textbf{S}oft (\textbf{RS2}) Alignment strategy utilizes the pre-trained CLIP model to align multi-domain multimodal data into a generalized representation space in a flexible manner, preserving intricate representations and enhancing the model's adaptability to various unseen conditions. We also design a \textbf{U}-shaped \textbf{D}ual \textbf{S}pace \textbf{A}daptation (\textbf{U-DSA}) module to enhance the adaptability of representations while maintaining generalization performance. These improvements not only enhance the framework's generalization capabilities but also boost its ability to represent complex representations. Our experimental results on four benchmark datasets under different evaluation protocols demonstrate that the MMDA framework outperforms existing state-of-the-art methods in terms of cross-domain generalization and multimodal detection accuracy. The code will be released soon.
Authors:Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen
Title: Explaining Speaker and Spoof Embeddings via Probing
Abstract:
This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.
Authors:Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng
Title: Room Impulse Responses help attackers to evade Deep Fake Detection
Abstract:
The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real-world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large-scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%.
Authors:Tianchi Liu, Ivan Kukanov, Zihan Pan, Qiongqiong Wang, Hardik B. Sailor, Kong Aik Lee
Title: Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing
Abstract:
The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach - Accent-based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual-trained models, improving their cross-lingual capabilities. We conduct experiments on a large-scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low-resource language scenarios.
Authors:Aurosweta Mahapatra, Ismail Rasim Ulgen, Berrak Sisman
Title: HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech
Abstract:
Current anti-spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. Prosody is central to human expressiveness and emotion, and humans instinctively use prosodic cues such as F0 patterns and voiced/unvoiced structure to distinguish natural from synthetic speech. In this paper, we propose HuLA, a two-stage prosody-aware multi-task learning framework for spoof detection. In Stage 1, a self-supervised learning (SSL) backbone is trained on real speech with auxiliary tasks of F0 prediction and voiced/unvoiced classification, enhancing its ability to capture natural prosodic variation similar to human perceptual learning. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out-of-domain dataset, including expressive, emotional, and cross-lingual attacks. These results demonstrate that explicit prosodic supervision, combined with SSL embeddings, substantially improves robustness against advanced synthetic speech attacks.
Authors:Yingjie Ma, Xun Lin, Yong Xu, Weicheng Xie, Zitong Yu
Title: PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning
Abstract:
Face anti-spoofing (FAS) has recently advanced in multimodal fusion, cross-domain generalization, and interpretability. With large language models and reinforcement learning (RL), strategy-based training offers new opportunities to jointly model these aspects. However, multimodal reasoning is more complex than unimodal reasoning, requiring accurate feature representation and cross-modal verification while facing scarce, high-quality annotations, which makes direct application of RL sub-optimal. We identify two key limitations of supervised fine-tuning plus RL (SFT+RL) for multimodal FAS: (1) limited multimodal reasoning paths restrict the use of complementary modalities and shrink the exploration space after SFT, weakening the effect of RL; and (2) mismatched single-task supervision versus diverse reasoning paths causes reasoning confusion, where models may exploit shortcuts by mapping images directly to answers and ignoring the intended reasoning. To address this, we propose PA-FAS, which enhances reasoning paths by constructing high-quality extended reasoning sequences from limited annotations, enriching paths and relaxing exploration constraints. We further introduce an answer-shuffling mechanism during SFT to force comprehensive multimodal analysis instead of using superficial cues, thereby encouraging deeper reasoning and mitigating shortcut learning. PA-FAS significantly improves multimodal reasoning accuracy and cross-domain generalization, and better unifies multimodal fusion, generalization, and interpretability for trustworthy FAS.
Authors:Xun Lin, Shuai Wang, Yi Yu, Zitong Yu, Jiale Zhou, Yizhong Liu, Xiaochun Cao, Alex Kot, Yefeng Zheng
Title: Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing
Abstract:
Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.
Authors:Hongyang Wang, Yichen Shi, Zhuofu Tao, Yuhao Gao, Liepiao Zhang, Xun Lin, Jun Feng, Xiaochen Yuan, Zitong Yu, Xiaochun Cao
Title: FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models
Abstract:
Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization. Our instruction datasets, protocols, and codes will be released soon.
Authors:Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki
Title: Optimizing DINOv2 with Registers for Face Anti-Spoofing
Abstract:
Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025'' and SiW dataset.
Authors:Sheng Xiang, Yidong Jiang, Yunting Chen, Dawei Cheng, Guoping Zhao, Changjun Jiang
Title: Generative Dynamic Graph Representation Learning for Conspiracy Spoofing Detection
Abstract:
Spoofing detection in financial trading is crucial, especially for identifying complex behaviors such as conspiracy spoofing. Traditional machine-learning approaches primarily focus on isolated node features, often overlooking the broader context of interconnected nodes. Graph-based techniques, particularly Graph Neural Networks (GNNs), have advanced the field by leveraging relational information effectively. However, in real-world spoofing detection datasets, trading behaviors exhibit dynamic, irregular patterns. Existing spoofing detection methods, though effective in some scenarios, struggle to capture the complexity of dynamic and diverse, evolving inter-node relationships. To address these challenges, we propose a novel framework called the Generative Dynamic Graph Model (GDGM), which models dynamic trading behaviors and the relationships among nodes to learn representations for conspiracy spoofing detection. Specifically, our approach incorporates the generative dynamic latent space to capture the temporal patterns and evolving market conditions. Raw trading data is first converted into time-stamped sequences. Then we model trading behaviors using the neural ordinary differential equations and gated recurrent units, to generate the representation incorporating temporal dynamics of spoofing patterns. Furthermore, pseudo-label generation and heterogeneous aggregation techniques are employed to gather relevant information and enhance the detection performance for conspiratorial spoofing behaviors. Experiments conducted on spoofing detection datasets demonstrate that our approach outperforms state-of-the-art models in detection accuracy. Additionally, our spoofing detection system has been successfully deployed in one of the largest global trading markets, further validating the practical applicability and performance of the proposed method.
Authors:Shota Iwamatsu, Koichi Ito, Takafumi Aoki
Title: Backdoor Poisoning Attack Against Face Spoofing Attack Detection Methods
Abstract:
Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false positives. In this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti-spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack's face image into a live face image without inducing any perceptible visual alterations. Through experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.
Authors:Fangling Jiang, Qi Li, Weining Wang, Gang Wang, Bing Liu, Zhenan Sun
Title: Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning
Abstract:
Recently the emergence of novel presentation attacks has drawn increasing attention to face anti-spoofing. However, existing methods tend to memorize data patterns from the training set, resulting in poor generalization to unknown attack types across different scenarios and limited interpretability. To address these challenges, this paper presents a reinforcement fine-tuning-based face anti-spoofing method that stimulates the capabilities of multimodal large language models to think and learn how to solve the anti-spoofing task itself, rather than relying on the memorization of authenticity patterns. We design verifiable class consistent reward and reasoning consistent reward, and employ a GRPO-based optimization strategy to guide the model in exploring reasoning policies from multiple perspectives to maximize expected rewards. As a result, through iterative trial-and-error learning while retaining only high-reward trajectories, the model distills highly generalizable decision-making rules from the extensive solution space to effectively address cross-domain face anti-spoofing tasks. Extensive experimental results demonstrate that our method achieves state-of-the-art cross-domain generalization performance. It generalizes well to diverse unknown attack types in unseen target domains while providing interpretable reasoning for its authenticity decisions without requiring labor-intensive textual annotations for training.
Authors:Fangling Jiang, Qi Li, Weining Wang, Wei Shen, Bing Liu, Zhenan Sun
Title: Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images
Abstract:
Face anti-spoofing is a critical technology for ensuring the security of face recognition systems. However, its ability to generalize across diverse scenarios remains a significant challenge. In this paper, we attribute the limited generalization ability to two key factors: covariate shift, which arises from external data collection variations, and semantic shift, which results from substantial differences in emerging attack types. To address both challenges, we propose a novel approach for learning unknown spoof prompts, relying solely on real face images from a single source domain. Our method generates textual prompts for real faces and potential unknown spoof attacks by leveraging the general knowledge embedded in vision-language models, thereby enhancing the model's ability to generalize to unseen target domains. Specifically, we introduce a diverse spoof prompt optimization framework to learn effective prompts. This framework constrains unknown spoof prompts within a relaxed prior knowledge space while maximizing their distance from real face images. Moreover, it enforces semantic independence among different spoof prompts to capture a broad range of spoof patterns. Experimental results on nine datasets demonstrate that the learned prompts effectively transfer the knowledge of vision-language models, enabling state-of-the-art generalization ability against diverse unknown attack types across unseen target domains without using any spoof face images.
Authors:Fangling Jiang, Qi Li, Bing Liu, Weining Wang, Caifeng Shan, Zhenan Sun, Ming-Hsuan Yang
Title: Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection
Abstract:
3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.
Authors:Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller
Title: Generalizable Audio Spoofing Detection using Non-Semantic Representations
Abstract:
Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.
Authors:Jefferson David Rodriguez Chivata, Davide Ghiani, Simone Maurizio La Cava, Marco Micheletto, Giulia Orrù, Federico Lama, Gian Luca Marcialis
Title: Deep Data Hiding for ICAO-Compliant Face Images: A Survey
Abstract:
ICAO-compliant facial images, initially designed for secure biometric passports, are increasingly becoming central to identity verification in a wide range of application contexts, including border control, digital travel credentials, and financial services. While their standardization enables global interoperability, it also facilitates practices such as morphing and deepfakes, which can be exploited for harmful purposes like identity theft and illegal sharing of identity documents. Traditional countermeasures like Presentation Attack Detection (PAD) are limited to real-time capture and offer no post-capture protection. This survey paper investigates digital watermarking and steganography as complementary solutions that embed tamper-evident signals directly into the image, enabling persistent verification without compromising ICAO compliance. We provide the first comprehensive analysis of state-of-the-art techniques to evaluate the potential and drawbacks of the underlying approaches concerning the applications involving ICAO-compliant images and their suitability under standard constraints. We highlight key trade-offs, offering guidance for secure deployment in real-world identity systems.
Authors:Juan E. Tapia, Mario Nieto, Juan M. Espin, Alvaro S. Rocamora, Javier Barrachina, Naser Damer, Christoph Busch, Marija Ivanovska, Leon Todorov, Renat Khizbullin, Lazar Lazarevich, Aleksei Grishin, Daniel Schulz, Sebastian Gonzalez, Amir Mohammadi, Ketan Kotwal, Sebastien Marcel, Raghavendra Mudgalgundurao, Kiran Raja, Patrick Schuch, Sushrut Patwardhan, Raghavendra Ramachandra, Pedro Couto Pereira, Joao Ribeiro Pinto, Mariana Xavier, Andrés Valenzuela, Rodrigo Lara, Borut Batagelj, Marko Peterlin, Peter Peer, Ajnas Muhammed, Diogo Nunes, Nuno Gonçalves
Title: Second Competition on Presentation Attack Detection on ID Card
Abstract:
This work summarises and reports the results of the second Presentation Attack Detection competition on ID cards. This new version includes new elements compared to the previous one. (1) An automatic evaluation platform was enabled for automatic benchmarking; (2) Two tracks were proposed in order to evaluate algorithms and datasets, respectively; and (3) A new ID card dataset was shared with Track 1 teams to serve as the baseline dataset for the training and optimisation. The Hochschule Darmstadt, Fraunhofer-IGD, and Facephi company jointly organised this challenge. 20 teams were registered, and 74 submitted models were evaluated. For Track 1, the "Dragons" team reached first place with an Average Ranking and Equal Error rate (EER) of AV-Rank of 40.48% and 11.44% EER, respectively. For the more challenging approach in Track 2, the "Incode" team reached the best results with an AV-Rank of 14.76% and 6.36% EER, improving on the results of the first edition of 74.30% and 21.87% EER, respectively. These results suggest that PAD on ID cards is improving, but it is still a challenging problem related to the number of images, especially of bona fide images.
Authors:Marco Micheletto, Giulia Orrù, Luca Ghiani, Gian Luca Marcialis
Title: Improving fingerprint presentation attack detection by an approach integrated into the personal verification stage
Abstract:
Presentation Attack Detection (PAD) systems are usually designed independently of the fingerprint verification system. While this can be acceptable for use cases where specific user templates are not predetermined, it represents a missed opportunity to enhance security in scenarios where integrating PAD with the fingerprint verification system could significantly leverage users' templates, which are the real target of a potential presentation attack. This does not mean that a PAD should be specifically designed for such users; that would imply the availability of many enrolled users' PAI and, consequently, complexity, time, and cost increase. On the contrary, we propose to equip a basic PAD, designed according to the state of the art, with an innovative add-on module called the Closeness Binary Code (CC) module. The term "closeness" refers to a peculiar property of the bona fide-related features: in an Euclidean feature space, genuine fingerprints tend to cluster in a specific pattern. First, samples from the same finger are close to each other, then samples from other fingers of the same user and finally, samples from fingers of other users. This property is statistically verified in our previous publication, and further confirmed in this paper. It is independent of the user population and the feature set class, which can be handcrafted or deep network-based (embeddings). Therefore, the add-on can be designed without the need for the targeted user samples; moreover, it exploits her/his samples' "closeness" property during the verification stage. Extensive experiments on benchmark datasets and state-of-the-art PAD methods confirm the benefits of the proposed add-on, which can be easily coupled with the main PAD module integrated into the fingerprint verification system.
Authors:Juan E. Tapia, Naser Damer, Christoph Busch, Juan M. Espin, Javier Barrachina, Alvaro S. Rocamora, Kristof Ocvirk, Leon Alessio, Borut Batagelj, Sushrut Patwardhan, Raghavendra Ramachandra, Raghavendra Mudgalgundurao, Kiran Raja, Daniel Schulz, Carlos Aravena
Title: First Competition on Presentation Attack Detection on ID Card
Abstract:
This paper summarises the Competition on Presentation Attack Detection on ID Cards (PAD-IDCard) held at the 2024 International Joint Conference on Biometrics (IJCB2024). The competition attracted a total of ten registered teams, both from academia and industry. In the end, the participating teams submitted five valid submissions, with eight models to be evaluated by the organisers. The competition presented an independent assessment of current state-of-the-art algorithms. Today, no independent evaluation on cross-dataset is available; therefore, this work determined the state-of-the-art on ID cards. To reach this goal, a sequestered test set and baseline algorithms were used to evaluate and compare all the proposals. The sequestered test dataset contains ID cards from four different countries. In summary, a team that chose to be "Anonymous" reached the best average ranking results of 74.80%, followed very closely by the "IDVC" team with 77.65%.
Authors:Zuying Xie, Changtao Miao, Ajian Liu, Jiabao Guo, Feng Li, Dan Guo, Yunfeng Diao
Title: SUEDE:Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement
Abstract:
Face recognition systems are vulnerable to physical attacks (e.g., printed photos) and digital threats (e.g., DeepFake), which are currently being studied as independent visual tasks, such as Face Anti-Spoofing and Forgery Detection. The inherent differences among various attack types present significant challenges in identifying a common feature space, making it difficult to develop a unified framework for detecting data from both attack modalities simultaneously. Inspired by the efficacy of Mixture-of-Experts (MoE) in learning across diverse domains, we explore utilizing multiple experts to learn the distinct features of various attack types. However, the feature distributions of physical and digital attacks overlap and differ. This suggests that relying solely on distinct experts to learn the unique features of each attack type may overlook shared knowledge between them. To address these issues, we propose SUEDE, the Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement. SUEDE combines a shared expert (always activated) to capture common features for both attack types and multiple routed experts (selectively activated) for specific attack types. Further, we integrate CLIP as the base network to ensure the shared expert benefits from prior visual knowledge and align visual-text representations in a unified space. Extensive results demonstrate SUEDE achieves superior performance compared to state-of-the-art unified detection methods.
Authors:Alain Komaty, Hatef Otroshi Shahreza, Anjith George, Sebastien Marcel
Title: Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning
Abstract:
This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD), outperforming several PAD models, including commercial solutions, in specific scenarios. Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning, where its performance improves as more examples are provided (reference data). We also observe that detailed prompts enable the model to provide scores reliably, a behavior not observed with concise prompts. Additionally, explanation-seeking prompts slightly enhance the model's performance by improving its interpretability. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios, despite not being explicitly instructed to classify attack types. Despite these strengths, GPT-4o faces challenges in zero-shot tasks, where its performance is limited compared to specialized PAD systems. Experiments were conducted on a subset of the SOTERIA dataset, ensuring compliance with data privacy regulations by using only data from consenting individuals. These findings underscore GPT-4o's promise in PAD applications, laying the groundwork for future research to address broader data privacy concerns and improve cross-dataset generalization. Code available here: https://gitlab.idiap.ch/bob/bob.paper.wacv2025_chatgpt_face_pad
Authors:Raghavendra Ramachandra, Sushma Venkatesh
Title: SpectraIrisPAD: Leveraging Vision Foundation Models for Spectrally Conditioned Multispectral Iris Presentation Attack Detection
Abstract:
Iris recognition is widely recognized as one of the most accurate biometric modalities. However, its growing deployment in real-world applications raises significant concerns regarding its vulnerability to Presentation Attacks (PAs). Effective Presentation Attack Detection (PAD) is therefore critical to ensure the integrity and security of iris-based biometric systems. While conventional iris recognition systems predominantly operate in the near-infrared (NIR) spectrum, multispectral imaging across multiple NIR bands provides complementary reflectance information that can enhance the generalizability of PAD methods. In this work, we propose \textbf{SpectraIrisPAD}, a novel deep learning-based framework for robust multispectral iris PAD. The SpectraIrisPAD leverages a DINOv2 Vision Transformer (ViT) backbone equipped with learnable spectral positional encoding, token fusion, and contrastive learning to extract discriminative, band-specific features that effectively distinguish bona fide samples from various spoofing artifacts. Furthermore, we introduce a new comprehensive dataset Multispectral Iris PAD (\textbf{MSIrPAD}) with diverse PAIs, captured using a custom-designed multispectral iris sensor operating at five distinct NIR wavelengths (800\,nm, 830\,nm, 850\,nm, 870\,nm, and 980\,nm). The dataset includes 18,848 iris images encompassing eight diverse PAI categories, including five textured contact lenses, print attacks, and display-based attacks. We conduct comprehensive experiments under unseen attack evaluation protocols to assess the generalization capability of the proposed method. SpectraIrisPAD consistently outperforms several state-of-the-art baselines across all performance metrics, demonstrating superior robustness and generalizability in detecting a wide range of presentation attacks.
Authors:Esteban M. Ruiz, Juan E. Tapia, Reinel T. Soto, Christoph Busch
Title: Identity Card Presentation Attack Detection: A Systematic Review
Abstract:
Remote identity verification is essential for modern digital security; however, it remains highly vulnerable to sophisticated Presentation Attacks (PAs) that utilise forged or manipulated identity documents. Although Deep Learning (DL) has driven advances in Presentation Attack Detection (PAD), the field is fundamentally limited by a lack of data and the poor generalisation of models across various document types and new attack methods. This article presents a systematic literature review (SLR) conducted in accordance with the PRISMA methodology, aiming to analyse and synthesise the current state of AI-based PAD for identity documents from 2020 to 2025 comprehensively. Our analysis reveals a significant methodological evolution: a transition from standard Convolutional Neural Networks (CNNs) to specialised forensic micro-artefact analysis, and more recently, the adoption of large-scale Foundation Models (FMs), marking a substantial shift in the field. We identify a central paradox that hinders progress: a critical "Reality Gap" exists between models validated on extensive, private datasets and those assessed using limited public datasets, which typically consist of mock-ups or synthetic data. This gap limits the reproducibility of research results. Additionally, we highlight a "Synthetic Utility Gap," where synthetic data generation the primary academic response to data scarcity often fails to predict forensic utility. This can lead to model overfitting to generation artefacts instead of the actual attack. This review consolidates our findings, identifies critical research gaps, and provides a definitive reference framework that outlines a prescriptive roadmap for future research aimed at developing secure, robust, and globally generalizable PAD systems.
Authors:Qingwen Zeng, Juan E. Tapia, Izan Garcia, Juan M. Espin, Christoph Busch
Title: ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset
Abstract:
Nowadays, the development of a Presentation Attack Detection (PAD) system for ID cards presents a challenge due to the lack of images available to train a robust PAD system and the increase in diversity of possible attack instrument species. Today, most algorithms focus on generating attack samples and do not take into account the limited number of bona fide images. This work is one of the first to propose a method for mimicking bona fide images by generating synthetic versions of them using Stable Diffusion, which may help improve the generalisation capabilities of the detector. Furthermore, the new images generated are evaluated in a system trained from scratch and in a commercial solution. The PAD system yields an interesting result, as it identifies our images as bona fide, which has a positive impact on detection performance and data restrictions.
Authors:Lazaro Janier Gonzalez-Sole, Juan E. Tapia, Christoph Busch
Title: Are Foundation Models All You Need for Zero-shot Face Presentation Attack Detection?
Abstract:
Although face recognition systems have undergone an impressive evolution in the last decade, these technologies are vulnerable to attack presentations (AP). These attacks are mostly easy to create and, by executing them against the system's capture device, the malicious actor can impersonate an authorised subject and thus gain access to the latter's information (e.g., financial transactions). To protect facial recognition schemes against presentation attacks, state-of-the-art deep learning presentation attack detection (PAD) approaches require a large amount of data to produce reliable detection performances and even then, they decrease their performance for unknown presentation attack instruments (PAI) or database (information not seen during training), i.e. they lack generalisability. To mitigate the above problems, this paper focuses on zero-shot PAD. To do so, we first assess the effectiveness and generalisability of foundation models in established and challenging experimental scenarios and then propose a simple but effective framework for zero-shot PAD. Experimental results show that these models are able to achieve performance in difficult scenarios with minimal effort of the more advanced PAD mechanisms, whose weights were optimised mainly with training sets that included APs and bona fide presentations. The top-performing foundation model outperforms by a margin the best from the state of the art observed with the leaving-one-out protocol on the SiW-Mv2 database, which contains challenging unknown 2D and 3D attacks
Authors:Juan E. Tapia, Christoph Busch
Title: Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards?
Abstract:
Nowadays, one of the main challenges in presentation attack detection (PAD) on ID cards is obtaining generalisation capabilities for a diversity of countries that are issuing ID cards. Most PAD systems are trained on one, two, or three ID documents because of privacy protection concerns. As a result, they do not obtain competitive results for commercial purposes when tested in an unknown new ID card country. In this scenario, Foundation Models (FM) trained on huge datasets can help to improve generalisation capabilities. This work intends to improve and benchmark the capabilities of FM and how to use them to adapt the generalisation on PAD of ID Documents. Different test protocols were used, considering zero-shot and fine-tuning and two different ID card datasets. One private dataset based on Chilean IDs and one open-set based on three ID countries: Finland, Spain, and Slovakia. Our findings indicate that bona fide images are the key to generalisation.
Authors:Juan E. Tapia, Fabian Stockhardt, Lázaro Janier González-Soler, Christoph Busch
Title: SynID: Passport Synthetic Dataset for Presentation Attack Detection
Abstract:
The demand for Presentation Attack Detection (PAD) to identify fraudulent ID documents in remote verification systems has significantly risen in recent years. This increase is driven by several factors, including the rise of remote work, online purchasing, migration, and advancements in synthetic images. Additionally, we have noticed a surge in the number of attacks aimed at the enrolment process. Training a PAD to detect fake ID documents is very challenging because of the limited number of ID documents available due to privacy concerns. This work proposes a new passport dataset generated from a hybrid method that combines synthetic data and open-access information using the ICAO requirement to obtain realistic training and testing images.
Authors:Yante Li, Hanwen Qi, Haoyu Chen, Xinlian Liang, Guoying Zhao
Title: Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection
Abstract:
In environmental protection, tree monitoring plays an essential role in maintaining and improving ecosystem health. However, precise monitoring is challenging because existing datasets fail to capture continuous fine-grained changes in trees due to low-resolution images and high acquisition costs. In this paper, we introduce UAVTC, a large-scale, long-term, high-resolution dataset collected using UAVs equipped with cameras, specifically designed to detect individual Tree Changes (TCs). UAVTC includes rich annotations and statistics based on biological knowledge, offering a fine-grained view for tree monitoring. To address environmental influences and effectively model the hierarchical diversity of physiological TCs, we propose a novel Hyperbolic Siamese Network (HSN) for TC detection, enabling compact and hierarchical representations of dynamic tree changes. Extensive experiments show that HSN can effectively capture complex hierarchical changes and provide a robust solution for fine-grained TC detection. In addition, HSN generalizes well to cross-domain face anti-spoofing task, highlighting its broader significance in AI. We believe our work, combining ecological insights and interdisciplinary expertise, will benefit the community by offering a new benchmark and innovative AI technologies.
Authors:Juan E. Tapia, Lázaro Janier González-Soler, Christoph Busch
Title: Towards Iris Presentation Attack Detection with Foundation Models
Abstract:
Foundation models are becoming increasingly popular due to their strong generalization capabilities resulting from being trained on huge datasets. These generalization capabilities are attractive in areas such as NIR Iris Presentation Attack Detection (PAD), in which databases are limited in the number of subjects and diversity of attack instruments, and there is no correspondence between the bona fide and attack images because, most of the time, they do not belong to the same subjects. This work explores an iris PAD approach based on two foundation models, DinoV2 and VisualOpenClip. The results show that fine-tuning prediction with a small neural network as head overpasses the state-of-the-art performance based on deep learning approaches. However, systems trained from scratch have still reached better results if bona fide and attack images are available.
Authors:Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal
Title: Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems
Abstract:
Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.
Authors:Yihan Wu, Jee-weon Jung, Hye-jin Shim, Xin Cheng, Xin Wang
Title: WildSpoof Challenge Evaluation Plan
Abstract:
The WildSpoof Challenge aims to advance the use of in-the-wild data in two intertwined speech processing tasks. It consists of two parallel tracks: (1) Text-to-Speech (TTS) synthesis for generating spoofed speech, and (2) Spoofing-robust Automatic Speaker Verification (SASV) for detecting spoofed speech. While the organizers coordinate both tracks and define the data protocols, participants treat them as separate and independent tasks. The primary objectives of the challenge are: (i) to promote the use of in-the-wild data for both TTS and SASV, moving beyond conventional clean and controlled datasets and considering real-world scenarios; and (ii) to encourage interdisciplinary collaboration between the spoofing generation (TTS) and spoofing detection (SASV) communities, thereby fostering the development of more integrated, robust, and realistic systems.
Authors:Kamel Kamel, Keshav Sood, Hridoy Sankar Dutta, Sunil Aryal
Title: A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems
Abstract:
Voice authentication has undergone significant changes from traditional systems that relied on handcrafted acoustic features to deep learning models that can extract robust speaker embeddings. This advancement has expanded its applications across finance, smart devices, law enforcement, and beyond. However, as adoption has grown, so have the threats. This survey presents a comprehensive review of the modern threat landscape targeting Voice Authentication Systems (VAS) and Anti-Spoofing Countermeasures (CMs), including data poisoning, adversarial, deepfake, and adversarial spoofing attacks. We chronologically trace the development of voice authentication and examine how vulnerabilities have evolved in tandem with technological advancements. For each category of attack, we summarize methodologies, highlight commonly used datasets, compare performance and limitations, and organize existing literature using widely accepted taxonomies. By highlighting emerging risks and open challenges, this survey aims to support the development of more secure and resilient voice authentication systems.
Authors:Jos Wigchert, Savio Sciancalepore, Gabriele Oligeri
Title: Detection of Aerial Spoofing Attacks to LEO Satellite Systems via Deep Learning
Abstract:
Detecting spoofing attacks to Low-Earth-Orbit (LEO) satellite systems is a cornerstone to assessing the authenticity of the received information and guaranteeing robust service delivery in several application domains. The solutions available today for spoofing detection either rely on additional communication systems, receivers, and antennas, or require mobile deployments. Detection systems working at the Physical (PHY) layer of the satellite communication link also require time-consuming and energy-hungry training processes on all satellites of the constellation, and rely on the availability of spoofed data, which are often challenging to collect. Moreover, none of such contributions investigate the feasibility of aerial spoofing attacks launched via drones operating at various altitudes. In this paper, we propose a new spoofing detection technique for LEO satellite constellation systems, applying anomaly detection on the received PHY signal via autoencoders. We validate our solution through an extensive measurement campaign involving the deployment of an actual spoofer (Software-Defined Radio) installed on a drone and injecting rogue IRIDIUM messages while flying at different altitudes with various movement patterns. Our results demonstrate that the proposed technique can reliably detect LEO spoofing attacks launched at different altitudes, while state-of-the-art competing approaches simply fail. We also release the collected data as open source, fostering further research on satellite security.
Authors:Tong Zhang, Yihuan Huang, Yanzhen Ren
Title: EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
Abstract:
The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.
Authors:Awais Khan, Ijaz Ul Haq, Khalid Mahmood Malik
Title: Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices
Abstract:
Voice authentication on IoT-enabled smart devices has gained prominence in recent years due to increasing concerns over user privacy and security. The current authentication systems are vulnerable to different voice-spoofing attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic legitimate voices to deceive authentication systems and enable fraudulent activities (e.g., impersonation, unauthorized access, financial fraud, etc.). Existing solutions are often designed to tackle a single type of attack, leading to compromised performance against unseen attacks. On the other hand, existing unified voice anti-spoofing solutions, not designed specifically for IoT, possess complex architectures and thus cannot be deployed on IoT-enabled smart devices. Additionally, most of these unified solutions exhibit significant performance issues, including higher equal error rates or lower accuracy for specific attacks. To overcome these issues, we present the parallel stacked aggregation network (PSA-Net), a lightweight framework designed as an anti-spoofing defense system for voice-controlled smart IoT devices. The PSA-Net processes raw audios directly and eliminates the need for dataset-dependent handcrafted features or pre-computed spectrograms. Furthermore, PSA-Net employs a split-transform-aggregate approach, which involves the segmentation of utterances, the extraction of intrinsic differentiable embeddings through convolutions, and the aggregation of them to distinguish legitimate from spoofed audios. In contrast to existing deep Resnet-oriented solutions, we incorporate cardinality as an additional dimension in our network, which enhances the PSA-Net ability to generalize across diverse attacks. The results show that the PSA-Net achieves more consistent performance for different attacks that exist in current anti-spoofing solutions.
Authors:Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, Eleanor Chodroff
Title: TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice
Abstract:
The development of robust, multilingual speaker recognition systems is hindered by a lack of large-scale, publicly available and multilingual datasets, particularly for the read-speech style crucial for applications like anti-spoofing. To address this gap, we introduce the TidyVoice dataset derived from the Mozilla Common Voice corpus after mitigating its inherent speaker heterogeneity within the provided client IDs. TidyVoice currently contains training and test data from over 212,000 monolingual speakers (Tidy-M) and around 4,500 multilingual speakers (Tidy-X) from which we derive two distinct conditions. The Tidy-M condition contains target and non-target trials from monolingual speakers across 81 languages. The Tidy-X condition contains target and non-target trials from multilingual speakers in both same- and cross-language trials. We employ two architectures of ResNet models, achieving a 0.35% EER by fine-tuning on our comprehensive Tidy-M partition. Moreover, we show that this fine-tuning enhances the model's generalization, improving performance on unseen conversational interview data from the CANDOR corpus. The complete dataset, evaluation trials, and our models are publicly released to provide a new resource for the community.
Authors:Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Jun-Ren Chen, Cheng-Hsiang Su, Yi-Chen Ou, Chiou-Ting Hsu, Pei-Kai Huang
Title: Robust Multi-Modal Face Anti-Spoofing with Domain Adaptation: Tackling Missing Modalities, Noisy Pseudo-Labels, and Model Degradation
Abstract:
Recent multi-modal face anti-spoofing (FAS) methods have investigated the potential of leveraging multiple modalities to distinguish live and spoof faces. However, pre-adapted multi-modal FAS models often fail to detect unseen attacks from new target domains. Although a more realistic domain adaptation (DA) scenario has been proposed for single-modal FAS to learn specific spoof attacks during inference, DA remains unexplored in multi-modal FAS methods. In this paper, we propose a novel framework, MFAS-DANet, to address three major challenges in multi-modal FAS under the DA scenario: missing modalities, noisy pseudo labels, and model degradation. First, to tackle the issue of missing modalities, we propose extracting complementary features from other modalities to substitute missing modality features or enhance existing ones. Next, to reduce the impact of noisy pseudo labels during model adaptation, we propose deriving reliable pseudo labels by leveraging prediction uncertainty across different modalities. Finally, to prevent model degradation, we design an adaptive mechanism that decreases the loss weight during unstable adaptations and increasing it during stable ones. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of our proposed MFAS-DANet.
Authors:Jun-Xiong Chong, Fang-Yu Hsu, Ming-Tsung Hsu, Yi-Ting Lin, Kai-Heng Chien, Chiou-Ting Hsu, Pei-Kai Huang
Title: Multi-Modal Face Anti-Spoofing via Cross-Modal Feature Transitions
Abstract:
Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi-modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single-modal FAS. Furthermore, during the inference stage, multi-modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to tackle the challenges in the multi-modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross-modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross-modal feature transitions between live and spoof samples to effectively detect out-of-distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two-class multi-modal FAS methods across most protocols.
Authors:Pei-Kai Huang, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Hao-Chiang Shao, Chiou-Ting Hsu
Title: Unsupervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing
Abstract:
Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbf{UFDANet}), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbf{UFDANet} employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbf{UFDANet} incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbf{UFDANet} outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.
Authors:Pei-Kai Huang, Jun-Xiong Chong, Cheng-Hsuan Chiang, Tzu-Hsien Chen, Tyng-Luh Liu, Chiou-Ting Hsu
Title: SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining
Abstract:
Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.
Authors:Asish Bera, Debotosh Bhattacharjee, Hubert P H Shum
Title: Two-Stage Human Verification using HandCAPTCHA and Anti-Spoofed Finger Biometrics with Feature Selection
Abstract:
This paper presents a human verification scheme in two independent stages to overcome the vulnerabilities of attacks and to enhance security. At the first stage, a hand image-based CAPTCHA (HandCAPTCHA) is tested to avert automated bot-attacks on the subsequent biometric stage. In the next stage, finger biometric verification of a legitimate user is performed with presentation attack detection (PAD) using the real hand images of the person who has passed a random HandCAPTCHA challenge. The electronic screen-based PAD is tested using image quality metrics. After this spoofing detection, geometric features are extracted from the four fingers (excluding the thumb) of real users. A modified forward-backward (M-FoBa) algorithm is devised to select relevant features for biometric authentication. The experiments are performed on the Bogazici University (BU) and the IIT-Delhi (IITD) hand databases using the k-nearest neighbor and random forest classifiers. The average accuracy of the correct HandCAPTCHA solution is 98.5%, and the false accept rate of a bot is 1.23%. The PAD is tested on 255 subjects of BU, and the best average error is 0%. The finger biometric identification accuracy of 98% and an equal error rate (EER) of 6.5% have been achieved for 500 subjects of the BU. For 200 subjects of the IITD, 99.5% identification accuracy, and 5.18% EER are obtained.
Authors:Surya Subramani, Hashim Ali, Hafiz Malik
Title: LJ-Spoof: A Generatively Varied Corpus for Audio Anti-Spoofing and Synthesis Source Tracing
Abstract:
Speaker-specific anti-spoofing and synthesis-source tracing are central challenges in audio anti-spoofing. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we introduce LJ-Spoof, a speaker-specific, generatively diverse corpus that systematically varies prosody, vocoders, generative hyperparameters, bona fide prompt sources, training regimes, and neural post-processing. The corpus spans one speakers-including studio-quality recordings-30 TTS families, 500 generatively variant subsets, 10 bona fide neural-processing variants, and more than 3 million utterances. This variation-dense design enables robust speaker-conditioned anti-spoofing and fine-grained synthesis-source tracing. We further position this dataset as both a practical reference training resource and a benchmark evaluation suite for anti-spoofing and source tracing.
Authors:Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau, David Guennec
Title: Multi-level SSL Feature Gating for Audio Deepfake Detection
Abstract:
Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural-sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS-R model as a front-end feature extractor. For downstream back-end classifier, we employ Multi-kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state-of-the-art performance on in-domain benchmarks while generalizing robustly to out-of-domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.
Authors:Oleksandr Kuznetsov, Emanuele Frontoni, Luca Romeo, Riccardo Rosati, Andrea Maranesi, Alessandro Muscatello
Title: Deep Learning Models for Robust Facial Liveness Detection
Abstract:
In the rapidly evolving landscape of digital security, biometric authentication systems, particularly facial recognition, have emerged as integral components of various security protocols. However, the reliability of these systems is compromised by sophisticated spoofing attacks, where imposters gain unauthorized access by falsifying biometric traits. Current literature reveals a concerning gap: existing liveness detection methodologies - designed to counteract these breaches - fall short against advanced spoofing tactics employing deepfakes and other artificial intelligence-driven manipulations. This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti-spoofing techniques. By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision. Extensive evaluations were conducted across five diverse datasets, encompassing a wide range of attack vectors and environmental conditions. Results demonstrate substantial advancement over existing systems, with our best model (AttackNet V2.2) achieving 99.9% average accuracy when trained on combined data. Moreover, our research unveils critical insights into the behavioral patterns of impostor attacks, contributing to a more nuanced understanding of their evolving nature. The implications are profound: our models do not merely fortify the authentication processes but also instill confidence in biometric systems across various sectors reliant on secure access.
Authors:Colton R. Crum, Adam Czajka
Title: Almost Right: Making First-layer Kernels Nearly Orthogonal Improves Model Generalization
Abstract:
An ongoing research challenge within several domains in computer vision is how to increase model generalization capabilities. Several attempts to improve model generalization performance are heavily inspired by human perceptual intelligence, which is remarkable in both its performance and efficiency to generalize to unknown samples. Many of these methods attempt to force portions of the network to be orthogonal, following some observation within neuroscience related to early vision processes. In this paper, we propose a loss component that regularizes the filtering kernels in the first convolutional layer of a network to make them nearly orthogonal. Deviating from previous works, we give the network flexibility in which pairs of kernels it makes orthogonal, allowing the network to navigate to a better solution space, imposing harsh penalties. Without architectural modifications, we report substantial gains in generalization performance using the proposed loss against previous works (including orthogonalization- and saliency-based regularization methods) across three different architectures (ResNet-50, DenseNet-121, ViT-b-16) and two difficult open-set recognition tasks: presentation attack detection in iris biometrics, and anomaly detection in chest X-ray images.
Authors:Zhuowei Li, Tianchen Zhao, Xiang Xu, Zheng Zhang, Zhihua Li, Xuanbai Chen, Qin Zhang, Alessandro Bergamo, Anil K. Jain, Yifan Xing
Title: Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing
Abstract:
Developing a face anti-spoofing model that meets the security requirements of clients worldwide is challenging due to the domain gap between training datasets and diverse end-user test data. Moreover, for security and privacy reasons, it is undesirable for clients to share a large amount of their face data with service providers. In this work, we introduce a novel method in which the face anti-spoofing model can be adapted by the client itself to a target domain at test time using only a small sample of data while keeping model parameters and training data inaccessible to the client. Specifically, we develop a prototype-based base model and an optimal transport-guided adaptor that enables adaptation in either a lightweight training or training-free fashion, without updating base model's parameters. Furthermore, we propose geodesic mixup, an optimal transport-based synthesis method that generates augmented training data along the geodesic path between source prototypes and target data distribution. This allows training a lightweight classifier to effectively adapt to target-specific characteristics while retaining essential knowledge learned from the source domain. In cross-domain and cross-attack settings, compared with recent methods, our method achieves average relative improvements of 19.17% in HTER and 8.58% in AUC, respectively.
Authors:Daniyal Ganiuly, Nurzhau Bolatbek
Title: Slice-Aware Spoofing Detection in 5G Networks Using Lightweight Machine Learning
Abstract:
The increasing virtualization of fifth generation (5G) networks expands the attack surface of the user plane, making spoofing a persistent threat to slice integrity and service reliability. This study presents a slice-aware lightweight machine-learning framework for detecting spoofing attacks within 5G network slices. The framework was implemented on a reproducible Open5GS and srsRAN testbed emulating three service classes such as enhanced Mobile Broadband (eMBB), Ultra-Reliable Low-Latency Communication (URLLC), and massive Machine-Type Communication (mMTC) under controlled benign and adversarial traffic. Two efficient classifiers, Logistic Regression and Random Forest, were trained independently for each slice using statistical flow features derived from mirrored user-plane traffic. Slice-aware training improved detection accuracy by up to 5% and achieved F1-scores between 0.93 and 0.96 while maintaining real-time operation on commodity edge hardware. The results demonstrate that aligning security intelligence with slice boundaries enhances detection reliability and preserves operational isolation, enabling practical deployment in 5G network-security environments. Conceptually, the work bridges network-security architecture and adaptive machine learning by showing that isolation-aware intelligence can achieve scalable, privacy-preserving spoofing defense without high computational cost.
Authors:Manasa, Sushrut Patwardhan, Narayan Vetrekar, Pavan Kumar, R. S. Gad, Raghavendra Ramachandra
Title: Introducing Nylon Face Mask Attacks: A Dataset for Evaluating Generalised Face Presentation Attack Detection
Abstract:
Face recognition systems are increasingly deployed across a wide range of applications, including smartphone authentication, access control, and border security. However, these systems remain vulnerable to presentation attacks (PAs), which can significantly compromise their reliability. In this work, we introduce a new dataset focused on a novel and realistic presentation attack instrument called Nylon Face Masks (NFMs), designed to simulate advanced 3D spoofing scenarios. NFMs are particularly concerning due to their elastic structure and photorealistic appearance, which enable them to closely mimic the victim's facial geometry when worn by an attacker. To reflect real-world smartphone-based usage conditions, we collected the dataset using an iPhone 11 Pro, capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four distinct presentation scenarios involving both humans and mannequins. We benchmark the dataset using five state-of-the-art PAD methods to evaluate their robustness under unseen attack conditions. The results demonstrate significant performance variability across methods, highlighting the challenges posed by NFMs and underscoring the importance of developing PAD techniques that generalise effectively to emerging spoofing threats.
Authors:Wenjie Liu, Panos Papadimitratos
Title: GNSS Spoofing Detection Based on Opportunistic Position Information
Abstract:
The limited or no protection for civilian Global Navigation Satellite System (GNSS) signals makes spoofing attacks relatively easy. With modern mobile devices often featuring network interfaces, state-of-the-art signals of opportunity (SOP) schemes can provide accurate network positions in replacement of GNSS. The use of onboard inertial sensors can also assist in the absence of GNSS, possibly in the presence of jammers. The combination of SOP and inertial sensors has received limited attention, yet it shows strong results on fully custom-built platforms. We do not seek to improve such special-purpose schemes. Rather, we focus on countering GNSS attacks, notably detecting them, with emphasis on deployment with consumer-grade platforms, notably smartphones, that provide off-the-shelf opportunistic information (i.e., network position and inertial sensor data). Our Position-based Attack Detection Scheme (PADS) is a probabilistic framework that uses regression and uncertainty analysis for positions. The regression optimization problem is a weighted mean square error of polynomial fitting, with constraints that the fitted positions satisfy the device velocity and acceleration. Then, uncertainty is modeled by a Gaussian process, which provides more flexibility to analyze how sure or unsure we are about position estimations. In the detection process, we combine all uncertainty information with the position estimations into a fused test statistic, which is the input utilized by an anomaly detector based on outlier ensembles. The evaluation shows that the PADS outperforms a set of baseline methods that rely on SOP or inertial sensor-based or statistical tests, achieving up to 3 times the true positive rate at a low false positive rate.
Authors:Honglu Zhang, Zhiqin Fang, Ningning Zhao, Saihui Hou, Long Ma, Renwang Pei, Zhaofeng He
Title: FaceCoT: A Benchmark Dataset for Face Anti-Spoofing with Chain-of-Thought Reasoning
Abstract:
Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image-text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision-language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.
Authors:Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, Arun Ross
Title: Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Abstract:
The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine-tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD-R-Full dataset was 97.55 percent without any fine-tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long-term vision of Artificial General Intelligence.
Authors:Wenjie Liu, Panos Papadimitratos
Title: Self-Supervised Federated GNSS Spoofing Detection with Opportunistic Data
Abstract:
Global navigation satellite systems (GNSS) are vulnerable to spoofing attacks, with adversarial signals manipulating the location or time information of receivers, potentially causing severe disruptions. The task of discerning the spoofing signals from benign ones is naturally relevant for machine learning, thus recent interest in applying it for detection. While deep learning-based methods are promising, they require extensive labeled datasets, consume significant computational resources, and raise privacy concerns due to the sensitive nature of position data. This is why this paper proposes a self-supervised federated learning framework for GNSS spoofing detection. It consists of a cloud server and local mobile platforms. Each mobile platform employs a self-supervised anomaly detector using long short-term memory (LSTM) networks. Labels for training are generated locally through a spoofing-deviation prediction algorithm, ensuring privacy. Local models are trained independently, and only their parameters are uploaded to the cloud server, which aggregates them into a global model using FedAvg. The updated global model is then distributed back to the mobile platforms and trained iteratively. The evaluation shows that our self-supervised federated learning framework outperforms position-based and deep learning-based methods in detecting spoofing attacks while preserving data privacy.
Authors:Pei-Kai Huanga, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Chiou-Ting Hsu
Title: Enhancing Learnable Descriptive Convolutional Vision Transformer for Face Anti-Spoofing
Abstract:
Face anti-spoofing (FAS) heavily relies on identifying live/spoof discriminative features to counter face presentation attacks. Recently, we proposed LDCformer to successfully incorporate the Learnable Descriptive Convolution (LDC) into ViT, to model long-range dependency of locally descriptive features for FAS. In this paper, we propose three novel training strategies to effectively enhance the training of LDCformer to largely boost its feature characterization capability. The first strategy, dual-attention supervision, is developed to learn fine-grained liveness features guided by regional live/spoof attentions. The second strategy, self-challenging supervision, is designed to enhance the discriminability of the features by generating challenging training data. In addition, we propose a third training strategy, transitional triplet mining strategy, through narrowing the cross-domain gap while maintaining the transitional relationship between live and spoof features, to enlarge the domain-generalization capability of LDCformer. Extensive experiments show that LDCformer under joint supervision of the three novel training strategies outperforms previous methods.
Authors:Tore Johansson, Marco Spanghero, Panos Papadimitratos
Title: Consumer INS Coupled with Carrier Phase Measurements for GNSS Spoofing Detection
Abstract:
Global Navigation Satellite Systems enable precise localization and timing even for highly mobile devices, but legacy implementations provide only limited support for the new generation of security-enhanced signals. Inertial Measurement Units have proved successful in augmenting the accuracy and robustness of the GNSS-provided navigation solution, but effective navigation based on inertial techniques in denied contexts requires high-end sensors. However, commercially available mobile devices usually embed a much lower-grade inertial system. To counteract an attacker transmitting all the adversarial signals from a single antenna, we exploit carrier phase-based observations coupled with a low-end inertial sensor to identify spoofing and meaconing. By short-time integration with an inertial platform, which tracks the displacement of the GNSS antenna, the high-frequency movement at the receiver is correlated with the variation in the carrier phase. In this way, we identify legitimate transmitters, based on their geometrical diversity with respect to the antenna system movement. We introduce a platform designed to effectively compare different tiers of commercial INS platforms with a GNSS receiver. By characterizing different inertial sensors, we show that simple MEMS INS perform as well as high-end industrial-grade sensors. Sensors traditionally considered unsuited for navigation purposes offer great performance at the short integration times used to evaluate the carrier phase information consistency against the high-frequency movement. Results from laboratory evaluation and through field tests at Jammertest 2024 show that the detector is up to 90% accurate in correctly identifying spoofing (or the lack of it), without any modification to the receiver structure, and with mass-production grade INS typical for mobile phones.
Authors:Junyi Peng, Ladislav Mošner, Lin Zhang, Oldřich Plchot, Themos Stafylakis, Lukáš Burget, Jan Černocký
Title: CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification
Abstract:
Self-supervised learning (SSL) models for speaker verification (SV) have gained significant attention in recent years. However, existing SSL-based SV systems often struggle to capture local temporal dependencies and generalize across different tasks. In this paper, we propose context-aware multi-head factorized attentive pooling (CA-MHFA), a lightweight framework that incorporates contextual information from surrounding frames. CA-MHFA leverages grouped, learnable queries to effectively model contextual dependencies while maintaining efficiency by sharing keys and values across groups. Experimental results on the VoxCeleb dataset show that CA-MHFA achieves EERs of 0.42\%, 0.48\%, and 0.96\% on Vox1-O, Vox1-E, and Vox1-H, respectively, outperforming complex models like WavLM-TDNN with fewer parameters and faster convergence. Additionally, CA-MHFA demonstrates strong generalization across multiple SSL models and tasks, including emotion recognition and anti-spoofing, highlighting its robustness and versatility.
Authors:Deepak Kumar Panda, Weisi Guo
Title: Real-Time Bayesian Detection of Drift-Evasive GNSS Spoofing in Reinforcement Learning Based UAV Deconfliction
Abstract:
Autonomous unmanned aerial vehicles (UAVs) rely on global navigation satellite system (GNSS) pseudorange measurements for accurate real-time localization and navigation. However, this dependence exposes them to sophisticated spoofing threats, where adversaries manipulate pseudoranges to deceive UAV receivers. Among these, drift-evasive spoofing attacks subtly perturb measurements, gradually diverting the UAVs trajectory without triggering conventional signal-level anti-spoofing mechanisms. Traditional distributional shift detection techniques often require accumulating a threshold number of samples, causing delays that impede rapid detection and timely response. Consequently, robust temporal-scale detection methods are essential to identify attack onset and enable contingency planning with alternative sensing modalities, improving resilience against stealthy adversarial manipulations. This study explores a Bayesian online change point detection (BOCPD) approach that monitors temporal shifts in value estimates from a reinforcement learning (RL) critic network to detect subtle behavioural deviations in UAV navigation. Experimental results show that this temporal value-based framework outperforms conventional GNSS spoofing detectors, temporal semi-supervised learning frameworks, and the Page-Hinkley test, achieving higher detection accuracy and lower false-positive and false-negative rates for drift-evasive spoofing attacks.
Authors:Samuel Webster, Adam Czajka
Title: Saliency-Guided Training for Fingerprint Presentation Attack Detection
Abstract:
Saliency-guided training, which directs model learning to important regions of images, has demonstrated generalization improvements across various biometric presentation attack detection (PAD) tasks. This paper presents its first application to fingerprint PAD. We conducted a 50-participant study to create a dataset of 800 human-annotated fingerprint perceptually-important maps, explored alongside algorithmically-generated "pseudosaliency," including minutiae-based, image quality-based, and autoencoder-based saliency maps. Evaluating on the 2021 Fingerprint Liveness Detection Competition testing set, we explore various configurations within five distinct training scenarios to assess the impact of saliency-guided training on accuracy and generalization. Our findings demonstrate the effectiveness of saliency-guided training for fingerprint PAD in both limited and large data contexts, and we present a configuration capable of earning the first place on the LivDet-2021 benchmark. Our results highlight saliency-guided training's promise for increased model generalization capabilities, its effectiveness when data is limited, and its potential to scale to larger datasets in fingerprint PAD. All collected saliency data and trained models are released with the paper to support reproducible research.
Authors:Alvaro S. Rocamora, Juan M. Espin, Juan E. Tapia
Title: Few-Shot Learning: Expanding ID Cards Presentation Attack Detection to Unknown ID Countries
Abstract:
This paper proposes a Few-shot Learning (FSL) approach for detecting Presentation Attacks on ID Cards deployed in a remote verification system and its extension to new countries. Our research analyses the performance of Prototypical Networks across documents from Spain and Chile as a baseline and measures the extension of generalisation capabilities of new ID Card countries such as Argentina and Costa Rica. Specifically targeting the challenge of screen display presentation attacks. By leveraging convolutional architectures and meta-learning principles embodied in Prototypical Networks, we have crafted a model that demonstrates high efficacy with Few-shot examples. This research reveals that competitive performance can be achieved with as Few-shots as five unique identities and with under 100 images per new country added. This opens a new insight for novel generalised Presentation Attack Detection on ID cards to unknown attacks.
Authors:Md Sahidullah, Hye-jin Shim, Rosa Gonzalez Hautamäki, Tomi H. Kinnunen
Title: Shortcut Learning in Binary Classifier Black Boxes: Applications to Voice Anti-Spoofing and Biometrics
Abstract:
The widespread adoption of deep-learning models in data-driven applications has drawn attention to the potential risks associated with biased datasets and models. Neglected or hidden biases within datasets and models can lead to unexpected results. This study addresses the challenges of dataset bias and explores ``shortcut learning'' or ``Clever Hans effect'' in binary classifiers. We propose a novel framework for analyzing the black-box classifiers and for examining the impact of both training and test data on classifier scores. Our framework incorporates intervention and observational perspectives, employing a linear mixed-effects model for post-hoc analysis. By evaluating classifier performance beyond error rates, we aim to provide insights into biased datasets and offer a comprehensive understanding of their influence on classifier behavior. The effectiveness of our approach is demonstrated through experiments on audio anti-spoofing and speaker verification tasks using both statistical models and deep neural networks. The insights gained from this study have broader implications for tackling biases in other domains and advancing the field of explainable artificial intelligence.
Authors:Mahsa Mitcheff, Siamul Karim Khan, Adam Czajka
Title: Gradient-Guided Exploration of Generative Model's Latent Space for Controlled Iris Image Augmentations
Abstract:
Developing reliable iris recognition and presentation attack detection methods requires diverse datasets that capture realistic variations in iris features and a wide spectrum of anomalies. Because of the rich texture of iris images, which spans a wide range of spatial frequencies, synthesizing same-identity iris images while controlling specific attributes remains challenging. In this work, we introduce a new iris image augmentation strategy by traversing a generative model's latent space toward latent codes that represent same-identity samples but with some desired iris image properties manipulated. The latent space traversal is guided by a gradient of specific geometrical, textural, or quality-related iris image features (e.g., sharpness, pupil size, iris size, or pupil-to-iris ratio) and preserves the identity represented by the image being manipulated. The proposed approach can be easily extended to manipulate any attribute for which a differentiable loss term can be formulated. Additionally, our approach can use either randomly generated images using either a pre-train GAN model or real-world iris images. We can utilize GAN inversion to project any given iris image into the latent space and obtain its corresponding latent code.
Authors:Rashik Shadman, M G Sarwar Murshed, Faraz Hussain
Title: Explainable Face Presentation Attack Detection via Ensemble-CAM
Abstract:
Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes - their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL-based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble-CAM, is proposed for providing visual explanations for the decisions made by deep learning-based face PAD systems. Our goal is to improve DL-based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL-based face PAD systems.
Authors:Jacob Piland, Chris Sweet, Adam Czajka
Title: Divisive Decisions: Improving Salience-Based Training for Generalization in Binary Classification Tasks
Abstract:
Existing saliency-guided training approaches improve model generalization by incorporating a loss term that compares the model's class activation map (CAM) for a sample's true-class ({\it i.e.}, correct-label class) against a human reference saliency map. However, prior work has ignored the false-class CAM(s), that is the model's saliency obtained for incorrect-label class. We hypothesize that in binary tasks the true and false CAMs should diverge on the important classification features identified by humans (and reflected in human saliency maps). We use this hypothesis to motivate three new saliency-guided training methods incorporating both true- and false-class model's CAM into the training strategy and a novel post-hoc tool for identifying important features. We evaluate all introduced methods on several diverse binary close-set and open-set classification tasks, including synthetic face detection, biometric presentation attack detection, and classification of anomalies in chest X-ray scans, and find that the proposed methods improve generalization capabilities of deep learning models over traditional (true-class CAM only) saliency-guided training approaches. We offer source codes and model weights\footnote{GitHub repository link removed to preserve anonymity} to support reproducible research.
Authors:Thanapat Trachu, Thanathai Lertpetchpun, Ekapol Chuangsuwanich
Title: Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing
Abstract:
Spoofed utterances always contain artifacts introduced by generative models. While several countermeasures have been proposed to detect spoofed utterances, most primarily focus on architectural improvements. In this work, we investigate how artifacts remain hidden in spoofed speech and how to enhance their presence. We propose a model-agnostic pipeline that amplifies artifacts using speech enhancement and various types of noise. Our approach consists of three key steps: noise addition, noise extraction, and noise amplification. First, we introduce noise into the raw speech. Then, we apply speech enhancement to extract the entangled noise and artifacts. Finally, we amplify these extracted features. Moreover, our pipeline is compatible with different speech enhancement models and countermeasure architectures. Our method improves spoof detection performance by up to 44.44\% on ASVspoof2019 and 26.34\% on ASVspoof2021.
Authors:Banafsheh Adami, Nima Karimian
Title: GRU-AUNet: A Domain Adaptation Framework for Contactless Fingerprint Presentation Attack Detection
Abstract:
Although contactless fingerprints offer user comfort, they are more vulnerable to spoofing. The current solution for anti-spoofing in the area of contactless fingerprints relies on domain adaptation learning, limiting their generalization and scalability. To address these limitations, we introduce GRU-AUNet, a domain adaptation approach that integrates a Swin Transformer-based UNet architecture with GRU-enhanced attention mechanisms, a Dynamic Filter Network in the bottleneck, and a combined Focal and Contrastive Loss function. Trained in both genuine and spoof fingerprint images, GRU-AUNet demonstrates robust resilience against presentation attacks, achieving an average BPCER of 0.09\% and APCER of 1.2\% in the CLARKSON, COLFISPOOF, and IIITD datasets, outperforming state-of-the-art domain adaptation methods.
Authors:Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen
Title: An Explainable Probabilistic Attribute Embedding Approach for Spoofed Speech Characterization
Abstract:
We propose a novel approach for spoofed speech characterization through explainable probabilistic attribute embeddings. In contrast to high-dimensional raw embeddings extracted from a spoofing countermeasure (CM) whose dimensions are not easy to interpret, the probabilistic attributes are designed to gauge the presence or absence of sub-components that make up a specific spoofing attack. These attributes are then applied to two downstream tasks: spoofing detection and attack attribution. To enforce interpretability also to the back-end, we adopt a decision tree classifier. Our experiments on the ASVspoof2019 dataset with spoof CM embeddings extracted from three models (AASIST, Rawboost-AASIST, SSL-AASIST) suggest that the performance of the attribute embeddings are on par with the original raw spoof CM embeddings for both tasks. The best performance achieved with the proposed approach for spoofing detection and attack attribution, in terms of accuracy, is 99.7% and 99.2%, respectively, compared to 99.7% and 94.7% using the raw CM embeddings. To analyze the relative contribution of each attribute, we estimate their Shapley values. Attributes related to acoustic feature prediction, waveform generation (vocoder), and speaker modeling are found important for spoofing detection; while duration modeling, vocoder, and input type play a role in spoofing attack attribution.
Authors:Ji Hyuk Jung, Ji Won Yoon
Title: Neutralization of IMU-Based GPS Spoofing Detection using external IMU sensor and feedback methodology
Abstract:
Autonomous Vehicles (AVs) refer to systems capable of perceiving their states and moving without human intervention. Among the factors required for autonomous decision-making in mobility, positional awareness of the vehicle itself is the most critical. Accordingly, extensive research has been conducted on defense mechanisms against GPS spoofing attacks, which threaten AVs by disrupting position recognition. Among these, detection methods based on internal IMU sensors are regarded as some of the most effective. In this paper, we propose a spoofing attack system designed to neutralize IMU sensor-based detection. First, we present an attack modeling approach for bypassing such detection. Then, based on EKF sensor fusion, we experimentally analyze both the impact of GPS spoofing values on the internal target system and how our proposed methodology reduces anomaly detection within the target system. To this end, this paper proposes an attack model that performs GPS spoofing by stealing internal dynamic state information using an external IMU sensor, and the experimental results demonstrate that attack values can be injected without being detected.
Authors:Shivangi Yadav, Arun Ross
Title: A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection
Abstract:
An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.
Authors:Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan
Title: Discrete optimal transport is a strong audio adversarial attack
Abstract:
In this paper, we show that discrete optimal transport (DOT) is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level WavLM embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Evaluated on ASVspoof2019 and ASVspoof5 with AASIST baselines, DOT yields consistently high equal error rate (EER) across datasets and remains competitive after CM fine-tuning, outperforming several conventional attacks in cross-dataset transfer. Ablation analysis highlights the practical impact of vocoder overlap. Results indicate that distribution-level alignment is a powerful and stable attack surface for deployed CMs.
Authors:Anas Ali, Mubashar Husain, Peter Hans
Title: Intelligent ARP Spoofing Detection using Multi-layered Machine Learning (ML) Techniques for IoT Networks
Abstract:
Address Resolution Protocol (ARP) spoofing remains a critical threat to IoT networks, enabling attackers to intercept, modify, or disrupt data transmission by exploiting ARP's lack of authentication. The decentralized and resource-constrained nature of IoT environments amplifies this vulnerability, making conventional detection mechanisms ineffective at scale. This paper introduces an intelligent, multi-layered machine learning framework designed to detect ARP spoofing in real-time IoT deployments. Our approach combines feature engineering based on ARP header behavior, traffic flow analysis, and temporal packet anomalies with a hybrid detection pipeline incorporating decision trees, ensemble models, and deep learning classifiers. We propose a hierarchical architecture to prioritize lightweight models at edge gateways and deeper models at centralized nodes to balance detection accuracy and computational efficiency. The system is validated on both simulated IoT traffic and the CICIDS2017 dataset, achieving over 97% detection accuracy with low false positive rates. Comparative evaluations with signature-based and rule-based systems demonstrate the robustness and generalizability of our approach. Our results show that intelligent machine learning integration enables proactive ARP spoofing detection tailored for IoT scenarios, laying the groundwork for scalable and autonomous network security solutions.
Authors:Hailin Li, Raghavendra Ramachandra, Mohamed Ragab, Soumik Mondal, Yong Kiam Tan, Khin Mi Mi Aung
Title: Unsupervised Fingerphoto Presentation Attack Detection With Diffusion Models
Abstract:
Smartphone-based contactless fingerphoto authentication has become a reliable alternative to traditional contact-based fingerprint biometric systems owing to rapid advances in smartphone camera technology. Despite its convenience, fingerprint authentication through fingerphotos is more vulnerable to presentation attacks, which has motivated recent research efforts towards developing fingerphoto Presentation Attack Detection (PAD) techniques. However, prior PAD approaches utilized supervised learning methods that require labeled training data for both bona fide and attack samples. This can suffer from two key issues, namely (i) generalization:the detection of novel presentation attack instruments (PAIs) unseen in the training data, and (ii) scalability:the collection of a large dataset of attack samples using different PAIs. To address these challenges, we propose a novel unsupervised approach based on a state-of-the-art deep-learning-based diffusion model, the Denoising Diffusion Probabilistic Model (DDPM), which is trained solely on bona fide samples. The proposed approach detects Presentation Attacks (PA) by calculating the reconstruction similarity between the input and output pairs of the DDPM. We present extensive experiments across three PAI datasets to test the accuracy and generalization capability of our approach. The results show that the proposed DDPM-based PAD method achieves significantly better detection error rates on several PAI classes compared to other baseline unsupervised approaches.
Authors:Daniyar Kurmankhojayev, Andrei Shadrikov, Dmitrii Gordin, Mikhail Shkorin, Danijar Gabdullin, Aigerim Kambetbayeva, Kanat Kuatov
Title: Virtual camera detection: Catching video injection attacks in remote biometric systems
Abstract:
Face anti-spoofing (FAS) is a vital component of remote biometric authentication systems based on facial recognition, increasingly used across web-based applications. Among emerging threats, video injection attacks -- facilitated by technologies such as deepfakes and virtual camera software -- pose significant challenges to system integrity. While virtual camera detection (VCD) has shown potential as a countermeasure, existing literature offers limited insight into its practical implementation and evaluation. This study introduces a machine learning-based approach to VCD, with a focus on its design and validation. The model is trained on metadata collected during sessions with authentic users. Empirical results demonstrate its effectiveness in identifying video injection attempts and reducing the risk of malicious users bypassing FAS systems.
Authors:Vienna Li, Justin Villa, Dan Diessner, Jayson Clifford, Laxima Niure Kandel
Title: C/N0 Analysis-Based GPS Spoofing Detection with Variable Antenna Orientations
Abstract:
GPS spoofing poses a growing threat to aviation by falsifying satellite signals and misleading aircraft navigation systems. This paper demonstrates a proof-of-concept spoofing detection strategy based on analyzing satellite Carrier-to-Noise Density Ratio (C/N$_0$) variation during controlled static antenna orientations. Using a u-blox EVK-M8U receiver and a GPSG-1000 satellite simulator, C/N$_0$ data is collected under three antenna orientations flat, banked right, and banked left) in both real-sky (non-spoofed) and spoofed environments. Our findings reveal that under non-spoofed signals, C/N$_0$ values fluctuate naturally with orientation, reflecting true geometric dependencies. However, spoofed signals demonstrate a distinct pattern: the flat orientation, which directly faces the spoofing antenna, consistently yielded the highest C/N$_0$ values, while both banked orientations showed reduced C/N$_0$ due to misalignment with the spoofing source. These findings suggest that simple maneuvers such as brief banking to induce C/N$_0$ variations can provide early cues of GPS spoofing for general aviation and UAV systems.
Authors:Ali Ghanbarzade, Hossein Soleimani
Title: GNSS/GPS Spoofing and Jamming Identification Using Machine Learning and Deep Learning
Abstract:
The increasing reliance on Global Navigation Satellite Systems (GNSS), particularly the Global Positioning System (GPS), underscores the urgent need to safeguard these technologies against malicious threats such as spoofing and jamming. As the backbone for positioning, navigation, and timing (PNT) across various applications including transportation, telecommunications, and emergency services GNSS is vulnerable to deliberate interference that poses significant risks. Spoofing attacks, which involve transmitting counterfeit GNSS signals to mislead receivers into calculating incorrect positions, can result in serious consequences, from navigational errors in civilian aviation to security breaches in military operations. Furthermore, the lack of inherent security measures within GNSS systems makes them attractive targets for adversaries. While GNSS/GPS jamming and spoofing systems consist of numerous components, the ability to distinguish authentic signals from malicious ones is essential for maintaining system integrity. Recent advancements in machine learning and deep learning provide promising avenues for enhancing detection and mitigation strategies against these threats. This paper addresses both spoofing and jamming by tackling real-world challenges through machine learning, deep learning, and computer vision techniques. Through extensive experiments on two real-world datasets related to spoofing and jamming detection using advanced algorithms, we achieved state of the art results. In the GNSS/GPS jamming detection task, we attained approximately 99% accuracy, improving performance by around 5% compared to previous studies. Additionally, we addressed a challenging tasks related to spoofing detection, yielding results that underscore the potential of machine learning and deep learning in this domain.
Authors:Sana Hafeez, Ghulam E Mustafa Abro, Hifza Mustafa
Title: Quantum-Resilient Threat Modelling for Secure RIS-Assisted ISAC in 6G UAV Corridors
Abstract:
The rapid deployment of unmanned aerial vehicle (UAV) corridors in sixth-generation (6G) networks requires safe, intelligence-driven integrated sensing and communications (ISAC). Reconfigurable intelligent surfaces (RIS) enhance spectrum efficiency, localisation accuracy, and situational awareness, while introducing new vulnerabilities. The rise of quantum computing increases the risks associated with harvest-now-decrypt-later strategies and quantum-enhanced spoofing. We propose a Quantum-Resilient Threat Modelling (QRTM) framework for RIS-assisted ISAC in UAV corridors to address these challenges. QRTM integrates classical, quantum-ready, and quantum-aided adversaries, countered using post-quantum cryptographic (PQC) primitives: ML-KEM for key establishment and Falcon for authentication, both embedded within RIS control signalling and UAV coordination. To strengthen security sensing, the framework introduces RIS-coded scene watermarking validated through a generalised likelihood ratio test (GLRT), with its detection probability characterised by the Marcum Q function. Furthermore, a Secure ISAC Utility (SIU) jointly optimises secrecy rate, spoofing detection, and throughput under RIS constraints, enabled by a scheduler with computational complexity of O(n^2). Monte Carlo evaluations using 3GPP Release 19 mid-band urban-canyon models (7-15 GHz) demonstrate a spoof-detection probability approaching 0.99 at a false-alarm rate of 1e-3, secrecy-rate retention exceeding 90 percent against quantum-capable adversaries, and signal-interference utilisation improvements of about 25 percent compared with baselines. These results show a standards-compliant path towards reliable, quantum-resilient ISAC for UAV corridors in smart cities and non-terrestrial networks.
Authors:Jan Zelinka, Oliver Kost, Marek Hrúz
Title: Deep Sequence-to-Sequence Models for GNSS Spoofing Detection
Abstract:
We present a data generation framework designed to simulate spoofing attacks and randomly place attack scenarios worldwide. We apply deep neural network-based models for spoofing detection, utilizing Long Short-Term Memory networks and Transformer-inspired architectures. These models are specifically designed for online detection and are trained using the generated dataset. Our results demonstrate that deep learning models can accurately distinguish spoofed signals from genuine ones, achieving high detection performance. The best results are achieved by Transformer-inspired architectures with early fusion of the inputs resulting in an error rate of 0.16%.
Authors:Rongyu Yu, Kan Chen, Zeyu Deng, Chen Wang, Burak Kizilkaya, Liying Emma Li
Title: Haptic-Based User Authentication for Tele-robotic System
Abstract:
Tele-operated robots rely on real-time user behavior mapping for remote tasks, but ensuring secure authentication remains a challenge. Traditional methods, such as passwords and static biometrics, are vulnerable to spoofing and replay attacks, particularly in high-stakes, continuous interactions. This paper presents a novel anti-spoofing and anti-replay authentication approach that leverages distinctive user behavioral features extracted from haptic feedback during human-robot interactions. To evaluate our authentication approach, we collected a time-series force feedback dataset from 15 participants performing seven distinct tasks. We then developed a transformer-based deep learning model to extract temporal features from the haptic signals. By analyzing user-specific force dynamics, our method achieves over 90 percent accuracy in both user identification and task classification, demonstrating its potential for enhancing access control and identity assurance in tele-robotic systems.
Authors:Dimitrios Koutsianos, Stavros Zacharopoulos, Yannis Panagakis, Themos Stafylakis
Title: Synthetic Speech Source Tracing using Metric Learning
Abstract:
This paper addresses source tracing in synthetic speech-identifying generative systems behind manipulated audio via speaker recognition-inspired pipelines. While prior work focuses on spoofing detection, source tracing lacks robust solutions. We evaluate two approaches: classification-based and metric-learning. We tested our methods on the MLAADv5 benchmark using ResNet and self-supervised learning (SSL) backbones. The results show that ResNet achieves competitive performance with the metric learning approach, matching and even exceeding SSL-based systems. Our work demonstrates ResNet's viability for source tracing while underscoring the need to optimize SSL representations for this task. Our work bridges speaker recognition methodologies with audio forensic challenges, offering new directions for combating synthetic media manipulation.
Authors:Binh Nguyen, Shuji Shi, Ryan Ofman, Thai Le
Title: What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection
Abstract:
Recent advances in text-to-speech technologies have enabled realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic-level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing transcript-level adversarial attacks. Our extensive evaluation reveals that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates surpass 60% on several open-source detector-voice pairs, and notably one commercial detection accuracy drops from 100% on synthetic audio to just 32%. Through a comprehensive feature attribution analysis, we identify that both linguistic complexity and model-level audio embedding similarity contribute strongly to detector vulnerability. We further demonstrate the real-world risk via a case study replicating the Brad Pitt audio deepfake scam, using transcript adversarial attacks to completely bypass commercial detectors. These results highlight the need to move beyond purely acoustic defenses and account for linguistic variation in the design of robust anti-spoofing systems. All source code will be publicly available.
Authors:Najeebullah, Maaz Salman, Zar Nawab Khan Swati
Title: Face Spoofing Detection using Deep Learning
Abstract:
Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.
Authors:Vedrana Krivokuca Hahn, Jeremy Maceiras, Alain Komaty, Philip Abbet, Sebastien Marcel
Title: in-Car Biometrics (iCarB) Datasets for Driver Recognition: Face, Fingerprint, and Voice
Abstract:
We present three biometric datasets (iCarB-Face, iCarB-Fingerprint, iCarB-Voice) containing face videos, fingerprint images, and voice samples, collected inside a car from 200 consenting volunteers. The data was acquired using a near-infrared camera, two fingerprint scanners, and two microphones, while the volunteers were seated in the driver's seat of the car. The data collection took place while the car was parked both indoors and outdoors, and different "noises" were added to simulate non-ideal biometric data capture that may be encountered in real-life driver recognition. Although the datasets are specifically tailored to in-vehicle biometric recognition, their utility is not limited to the automotive environment. The iCarB datasets, which are available to the research community, can be used to: (i) evaluate and benchmark face, fingerprint, and voice recognition systems (we provide several evaluation protocols); (ii) create multimodal pseudo-identities, to train/test multimodal fusion algorithms; (iii) create Presentation Attacks from the biometric data, to evaluate Presentation Attack Detection algorithms; (iv) investigate demographic and environmental biases in biometric systems, using the provided metadata. To the best of our knowledge, ours are the largest and most diverse publicly available in-vehicle biometric datasets. Most other datasets contain only one biometric modality (usually face), while our datasets consist of three modalities, all acquired in the same automotive environment. Moreover, iCarB-Fingerprint seems to be the first publicly available in-vehicle fingerprint dataset. Finally, the iCarB datasets boast a rare level of demographic diversity among the 200 data subjects, including a 50/50 gender split, skin colours across the whole Fitzpatrick-scale spectrum, and a wide age range (18-60+). So, these datasets will be valuable for advancing biometrics research.
Authors:Esteban Gómez, Tom Bäckström
Title: Performance and Complexity Trade-off Optimization of Speech Models During Training
Abstract:
In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.
Authors:Ashwin Poudel, Utsav Poudel, Dikshyanta Aryal, Anuj Nepal, Pranish Pathak, Subramaniyaswamy V
Title: A Quantum-Secure and Blockchain-Integrated E-Voting Framework with Identity Validation
Abstract:
The rapid growth of quantum computing poses a threat to the cryptographic foundations of digital systems, requiring the development of secure and scalable electronic voting (evoting) frameworks. We introduce a post-quantum-secure evoting architecture that integrates Falcon lattice-based digital signatures, biometric authentication via MobileNetV3 and AdaFace, and a permissioned blockchain for tamper-proof vote storage. Voter registration involves capturing facial embeddings, which are digitally signed using Falcon and stored on-chain to ensure integrity and non-repudiation. During voting, real-time biometric verification is performed using anti-spoofing techniques and cosine-similarity matching. The system demonstrates low latency and robust spoof detection, monitored through Prometheus and Grafana for real-time auditing. The average classification error rates (ACER) are below 3.5% on the CelebA Spoof dataset and under 8.2% on the Wild Face Anti-Spoofing (WFAS) dataset. Blockchain anchoring incurs minimal gas overhead, approximately 3.3% for registration and 0.15% for voting, supporting system efficiency, auditability, and transparency. The experimental results confirm the system's scalability, efficiency, and resilience under concurrent loads. This approach offers a unified solution to address key challenges in voter authentication, data integrity, and quantum-resilient security for digital systems.
Authors:Byron Dowling, Jozef Probcin, Adam Czajka
Title: AutoSIGHT: Automatic Eye Tracking-based System for Immediate Grading of Human experTise
Abstract:
Can we teach machines to assess the expertise of humans solving visual tasks automatically based on eye tracking features? This paper proposes AutoSIGHT, Automatic System for Immediate Grading of Human experTise, that classifies expert and non-expert performers, and builds upon an ensemble of features extracted from eye tracking data while the performers were solving a visual task. Results on the task of iris Presentation Attack Detection (PAD) used for this study show that with a small evaluation window of just 5 seconds, AutoSIGHT achieves an average average Area Under the ROC curve performance of 0.751 in subject-disjoint train-test regime, indicating that such detection is viable. Furthermore, when a larger evaluation window of up to 30 seconds is available, the Area Under the ROC curve (AUROC) increases to 0.8306, indicating the model is effectively leveraging more information at a cost of slightly delayed decisions. This work opens new areas of research on how to incorporate the automatic weighing of human and machine expertise into human-AI pairing setups, which need to react dynamically to nonstationary expertise distribution between the human and AI players (e.g. when the experts need to be replaced, or the task at hand changes rapidly). Along with this paper, we offer the eye tracking data used in this study collected from 6 experts and 53 non-experts solving iris PAD visual task.
Authors:Pedro H. Lui, Lucas P. Siqueira, Juliano F. Kazienko, Vagner E. Quincozes, Silvio E. Quincozes, Daniel Welfer
Title: On the Performance of Cyber-Biomedical Features for Intrusion Detection in Healthcare 5.0
Abstract:
Healthcare 5.0 integrates Artificial Intelligence (AI), the Internet of Things (IoT), real-time monitoring, and human-centered design toward personalized medicine and predictive diagnostics. However, the increasing reliance on interconnected medical technologies exposes them to cyber threats. Meanwhile, current AI-driven cybersecurity models often neglect biomedical data, limiting their effectiveness and interpretability. This study addresses this gap by applying eXplainable AI (XAI) to a Healthcare 5.0 dataset that integrates network traffic and biomedical sensor data. Classification outputs indicate that XGBoost achieved 99% F1-score for benign and data alteration, and 81% for spoofing. Explainability findings reveal that network data play a dominant role in intrusion detection whereas biomedical features contributed to spoofing detection, with temperature reaching a Shapley values magnitude of 0.37.
Authors:Anudeep Vurity, Emanuela Marasco, Raghavendra Ramachandra, Jongwoo Park
Title: ColFigPhotoAttnNet: Reliable Finger Photo Presentation Attack Detection Leveraging Window-Attention on Color Spaces
Abstract:
Finger photo Presentation Attack Detection (PAD) can significantly strengthen smartphone device security. However, these algorithms are trained to detect certain types of attacks. Furthermore, they are designed to operate on images acquired by specific capture devices, leading to poor generalization and a lack of robustness in handling the evolving nature of mobile hardware. The proposed investigation is the first to systematically analyze the performance degradation of existing deep learning PAD systems, convolutional and transformers, in cross-capture device settings. In this paper, we introduce the ColFigPhotoAttnNet architecture designed based on window attention on color channels, followed by the nested residual network as the predictor to achieve a reliable PAD. Extensive experiments using various capture devices, including iPhone13 Pro, GooglePixel 3, Nokia C5, and OnePlusOne, were carried out to evaluate the performance of proposed and existing methods on three publicly available databases. The findings underscore the effectiveness of our approach.
Authors:Zicheng Shen, Feng Zhao, Yibo Ni, Yuanmu Yang
Title: Extended monocular 3D imaging
Abstract:
3D vision is of paramount importance for numerous applications ranging from machine intelligence to precision metrology. Despite much recent progress, the majority of 3D imaging hardware remains bulky and complicated and provides much lower image resolution compared to their 2D counterparts. Moreover, there are many well-known scenarios that existing 3D imaging solutions frequently fail. Here, we introduce an extended monocular 3D imaging (EM3D) framework that fully exploits the vectorial wave nature of light. Via the multi-stage fusion of diffraction- and polarization-based depth cues, using a compact monocular camera equipped with a diffractive-refractive hybrid lens, we experimentally demonstrate the snapshot acquisition of a million-pixel and accurate 3D point cloud for extended scenes that are traditionally challenging, including those with low texture, being highly reflective, or nearly transparent, without a data prior. Furthermore, we discover that the combination of depth and polarization information can unlock unique new opportunities in material identification, which may further expand machine intelligence for applications like target recognition and face anti-spoofing. The straightforward yet powerful architecture thus opens up a new path for a higher-dimensional machine vision in a minimal form factor, facilitating the deployment of monocular cameras for applications in much more diverse scenarios.
Authors:Ariel Larey, Eyal Rond, Omer Achrack
Title: A Multi-Modal Approach for Face Anti-Spoofing in Non-Calibrated Systems using Disparity Maps
Abstract:
Face recognition technologies are increasingly used in various applications, yet they are vulnerable to face spoofing attacks. These spoofing attacks often involve unique 3D structures, such as printed papers or mobile device screens. Although stereo-depth cameras can detect such attacks effectively, their high-cost limits their widespread adoption. Conversely, two-sensor systems without extrinsic calibration offer a cost-effective alternative but are unable to calculate depth using stereo techniques. In this work, we propose a method to overcome this challenge by leveraging facial attributes to derive disparity information and estimate relative depth for anti-spoofing purposes, using non-calibrated systems. We introduce a multi-modal anti-spoofing model, coined Disparity Model, that incorporates created disparity maps as a third modality alongside the two original sensor modalities. We demonstrate the effectiveness of the Disparity Model in countering various spoof attacks using a comprehensive dataset collected from the Intel RealSense ID Solution F455. Our method outperformed existing methods in the literature, achieving an Equal Error Rate (EER) of 1.71% and a False Negative Rate (FNR) of 2.77% at a False Positive Rate (FPR) of 1%. These errors are lower by 2.45% and 7.94% than the errors of the best comparison method, respectively. Additionally, we introduce a model ensemble that addresses 3D spoof attacks as well, achieving an EER of 2.04% and an FNR of 3.83% at an FPR of 1%. Overall, our work provides a state-of-the-art solution for the challenging task of anti-spoofing in non-calibrated systems that lack depth information.
Authors:Jason Anderson
Title: World's First Authenticated Satellite Pseudorange from Orbit
Abstract:
Cryptographic Ranging Authentication is here! We present initial results on the Pulsar authenticated ranging service broadcast from space with Pulsar-0 utilizing a recording taken at Xona headquarters in Burlingame, CA. No assumptions pertaining to the ownership or leakage of encryption keys are required. This work discusses the Pulsar watermark design and security analysis. We derive the Pulsar watermark's probabilities of missed detection and false alarm, and we discuss the required receiver processing needed to utilize the Pulsar watermark. We present validation results of the Pulsar watermark utilizing the transmissions from orbit. Lastly, we provide results that demonstrate the spoofing detection efficacy with a spoofing scenario that incorporates the authentic transmissions from orbit. Because we make no assumption about the leakage of symmetric encryption keys, this work provides mathematical justification of the watermark's security, and our July 2025 transmissions from orbit, we claim the world's first authenticated satellite pseudorange from orbit.