arXiv Papers of Anti-Spoofing

Abstract:
Face Anti‑Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision‑Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision‑only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre‑trained models, such as supervised CNNs, supervised ViTs, and self‑supervised ViTs, under severe cross‑domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self‑supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine‑grained spoofing cues. Combined with Face Anti‑Spoofing Data Augmentation (FAS‑Aug), Patch‑wise Data Augmentation (PDA) and Attention‑weighted Patch Loss (APL), our proposed vision‑only baseline achieves state‑of‑the‑art performance in the MICO protocol. This baseline outperforms existing methods under the data‑constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision‑only baseline for FAS, demonstrating that optimized self‑supervised vision transformers can serve as a backbone for both vision‑only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS‑VFMbenchmark‑CVPRW2026/ .

Abstract:
RuASD (Russian AntiSpoofing Dataset) is a dedicated, reproducible benchmark for Russian‑language speech anti‑spoofing designed to evaluate both in‑domain discrimination and robustness to deployment‑style distribution shifts. It combines a large spoof subset synthesized using 37 modern Russian‑capable TTS and voice‑cloning systems with a bona fide subset curated from multiple heterogeneous open Russian speech corpora, enabling systematic evaluation across diverse data sources. To emulate typical dissemination and channel effects in a controlled and reproducible manner, RuASD includes configurable simulations of platform and transmission distortions, including room reverberation, additive noise/music, and a range of speech‑codec transcodings implemented via a unified processing chain. We benchmark a diverse set of publicly available anti‑spoofing countermeasures spanning lightweight supervised architectures, graph‑attention models, SSL‑based detectors, and large‑scale pretrained systems, and report reference results on both clean and simulated conditions to characterize robustness under realistic perturbation pipelines. The dataset is publickly available at \hrefhttps://huggingface.co/datasets/MTUCI/RuASD\underlineHugging Face and \hrefhttps://modelscope.cn/datasets/lab260/RuASD\underlineModelScope.

Abstract:
Existing speech anti‑spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real‑world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi‑API audio anti‑spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open‑source models, and online platforms. Furthermore, we propose Nes2Net‑LA, a local‑attention enhanced variant of Nes2Net that improves local context modeling and fine‑grained spoofing feature extraction. Based on this dataset, we also define the API tracing task, enabling fine‑grained attribution of spoofed audio to its generation source. Experiments show that Nes2Net‑LA achieves state‑of‑the‑art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnotehttps://github.com/XuepingZhang/MultiAPI‑Spoof and dataset \footnotehttps://xuepingzhang.github.io/MultiAPI‑Spoof‑Dataset/ have been released.

Abstract:
Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2‑based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti‑Spoofing Workshop: Unified Physical‑Digital Attacks Detection@ICCV2025'' and SiW dataset. The project page is available at: https://gsisaoki.github.io/FAS‑DINOv2‑ICCVW/ .

Abstract:
With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS‑VFM, a scalable self‑supervised pre‑training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS‑VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR‑P masking, which explicitly prompts the model to pursue meaningful intra‑region Consistency and challenging inter‑region Coherency. We present a reliable self‑distillation mechanism that seamlessly couples MIM with ID to establish underlying local‑to‑global Correspondence. After pre‑training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross‑dataset deepfake detection, cross‑domain face anti‑spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre‑trained FS‑VFM, we further propose FS‑Adapter, a lightweight plug‑and‑play bottleneck atop the frozen backbone with a novel real‑anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS‑VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self‑supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task‑specific methods, while FS‑Adapter offers an excellent efficiency‑performance trade‑off. The code and models are available on https://fsfm‑3c.github.io/fsvfm.html.

Abstract:
This paper presents the first study on the impact of audio watermarking on spoofing countermeasures. While anti‑spoofing systems are essential for securing speech‑based applications, the influence of widely used audio watermarking, originally designed for copyright protection, remains largely unexplored. We construct watermark‑augmented training and evaluation datasets, named the Watermark‑Spoofing dataset, by applying diverse handcrafted and neural watermarking methods to existing anti‑spoofing datasets. Experiments show that watermarking consistently degrades anti‑spoofing performance, with higher watermark density correlating with higher Equal Error Rates (EERs). To mitigate this, we propose the Knowledge‑Preserving Watermark Learning (KPWL) framework, enabling models to adapt to watermark‑induced shifts while preserving their original‑domain spoofing detection capability. These findings reveal audio watermarking as a previously overlooked domain shift and establish the first benchmark for developing watermark‑resilient anti‑spoofing systems. All related protocols are publicly available at https://github.com/Alphawarheads/Watermark_Spoofing.git

Abstract:
Component‑level audio Spoofing (Comp‑Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti‑spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component‑level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation‑enhanced joint learning framework that separates audio components apart and applies anti‑spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.

Abstract:
Recent face anti‑spoofing (FAS) methods have shown remarkable cross‑domain performance by employing vision‑language models like CLIP. However, existing CLIP‑based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP‑FAS, a novel framework incorporating two key modules: Multi‑View Slot attention (MVS) and Multi‑Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain‑specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP‑FAS achieves superior generalization performance, outperforming previous state‑of‑the‑art methods on cross‑domain datasets. Code: https://github.com/Elune001/MVP‑FAS.

Abstract:
Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired‑Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality‑agnostic liveness cues. Evaluated on the 6th Face Anti‑Spoofing Challenge Unified Physical‑Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real‑world deployment. Code and pretrained models are available at https://github.com/xPONYx/iccv2025_deepfake_challenge.

Abstract:
Face anti‑spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross‑domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision‑language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta‑domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction‑tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content‑based instructions focus on the essential semantics of spoofing, and style‑based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.

Abstract:
Advances in voice conversion and text‑to‑speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti‑spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self‑supervised speech representations in limited‑data settings, substitutes the original graph attention block with a standardized multi‑head attention module using heterogeneous query projections, and replaces heuristic frame‑segment fusion with a trainable, context‑aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6% equal error rate (EER), improving on a re‑implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.

Abstract:
Domain Generalizable Face Anti‑Spoofing (DGFAS) methods effectively capture domain‑invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group‑wise Scaling Risk Minimization (GS‑RM). Specifically, GS‑RM facilitates bias alignment by balancing group‑wise losses across multiple domains. FOD employs the Gram‑Schmidt orthogonalization process to decompose the feature space explicitly into domain‑invariant and domain‑specific subspaces. By enforcing orthogonality between domain‑specific and domain‑invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state‑of‑the‑art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.

Abstract:
Biometric authentication systems are increasingly being deployed in critical applications, but they remain susceptible to spoofing. Since most of the research efforts focus on modality‑specific anti‑spoofing techniques, building a unified, resource‑efficient solution across multiple biometric modalities remains a challenge. To address this, we propose LitMAS, a Light weight and generalizable Multi‑modal Anti‑Spoofing framework designed to detect spoofing attacks in speech, face, iris, and fingerprint‑based biometric systems. At the core of LitMAS is a Modality‑Aligned Concentration Loss, which enhances inter‑class separability while preserving cross‑modal consistency and enabling robust spoof detection across diverse biometric traits. With just 6M parameters, LitMAS surpasses state‑of‑the‑art methods by 1.36% in average EER across seven datasets, demonstrating high efficiency, strong generalizability, and suitability for edge deployment. Code and trained models are available at https://github.com/IAB‑IITJ/LitMAS.

Abstract:
Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti‑spoofing data augmentation and patch‑wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU‑NPU and SiW datasets. The project page is available at: https://gsisaoki.github.io/FAS‑ViT‑CVPRW/ .

Abstract:
Traditional anti‑spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high‑quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof‑TTS, a corpus of emotional text‑to‑speech samples. Our analysis shows existing anti‑spoofing models struggle with emotional synthetic speech, exposing risks of emotion‑targeted attacks. Even trained on emotional data, the models underperform due to limited focus on emotional aspect and show performance disparities across emotions. This highlights the need for emotion‑focused anti‑spoofing paradigm in both dataset and methodology. We propose GEM, a gated ensemble of emotion‑specialized models with a speech emotion recognition gating network. GEM performs effectively across all emotions and neutral state, improving defenses against spoofing attacks. We release the EmoSpoof‑TTS Dataset: https://emospoof‑tts.github.io/Dataset/

Abstract:
Face anti‑spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision‑making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre‑training and supervised fine‑tuning (SFT) datasets, FaceShield‑pre10K and FaceShield‑sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof‑aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt‑guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse‑grained classification, fine‑grained classification, reasoning, and attack localization. Our instruction datasets, protocols, and codes will be released at https://github.com/Why0912/FaceShield.

Abstract:
Speech foundation models have significantly advanced various speech‑related tasks by providing exceptional representation capabilities. However, their high‑dimensional output features often create a mismatch with downstream task models, which typically require lower‑dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back‑end architecture designed to directly process high‑dimensional features without DR layers. The nested structure enhances multi‑scale feature extraction, improves feature interaction, and preserves high‑dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back‑end computational cost reduction over the state‑of‑the‑art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In‑the‑Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real‑world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre‑trained models are available at https://github.com/Liu‑Tianchi/Nes2Net.

Abstract:
Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain‑specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain‑invariant features, it struggles when confronted with diverse domain‑specific information, e.g., intra‑class shifts, that exhibits multi‑modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain‑specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier‑driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain‑specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain‑specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG's comparable performance on five DG benchmarks and one face anti‑spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.

Abstract:
With the availability of diverse sensor modalities (i.e., RGB, Depth, Infrared) and the success of multi‑modal learning, multi‑modal face anti‑spoofing (FAS) has emerged as a prominent research focus. The intuition behind it is that leveraging multiple modalities can uncover more intrinsic spoofing traces. However, this approach presents more risk of misalignment. We identify two main types of misalignment: (1) Intra‑domain modality misalignment, where the importance of each modality varies across different attacks. For instance, certain modalities (e.g., Depth) may be non‑defensive against specific attacks (e.g., 3D mask), indicating that each modality has unique strengths and weaknesses in countering particular attacks. Consequently, simple fusion strategies may fall short. (2) Inter‑domain modality misalignment, where the introduction of additional modalities exacerbates domain shifts, potentially overshadowing the benefits of complementary fusion. To tackle (1), we propose a alignment module between modalities based on mutual information, which adaptively enhances favorable modalities while suppressing unfavorable ones. To address (2), we employ a dual alignment optimization method that aligns both sub‑domain hyperplanes and modality angle margins, thereby mitigating domain gaps. Our method, dubbed Dual Alignment of Domain and Modality (DADM), achieves state‑of‑the‑art performance in extensive experiments across four challenging protocols demonstrating its robustness in multi‑modal domain generalization scenarios. The codes will be released soon.

Abstract:
Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre‑trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task‑specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at https://github.com/gurayozgur/FoundPAD .

Abstract:
In the domain of facial recognition security, multimodal Face Anti‑Spoofing (FAS) is essential for countering presentation attacks. However, existing technologies encounter challenges due to modality biases and imbalances, as well as domain shifts. Our research introduces a Mixture of Experts (MoE) model to address these issues effectively. We identified three limitations in traditional MoE approaches to multimodal FAS: (1) Coarse‑grained experts' inability to capture nuanced spoofing indicators; (2) Gated networks' susceptibility to input noise affecting decision‑making; (3) MoE's sensitivity to prompt tokens leading to overfitting with conventional learning methods. To mitigate these, we propose the Bypass Isolated Gating MoE (BIG‑MoE) framework, featuring: (1) Fine‑grained experts for enhanced detection of subtle spoofing cues; (2) An isolation gating mechanism to counteract input noise; (3) A novel differential convolutional prompt bypass enriching the gating network with critical local features, thereby improving perceptual capabilities. Extensive experiments on four benchmark datasets demonstrate significant generalization performance improvement in multimodal FAS task. The code is released at https://github.com/murInJ/BIG‑MoE.

Abstract:
Iris‑based biometric systems are vulnerable to presentation attacks (PAs), where adversaries present physical artifacts (e.g., printed iris images, textured contact lenses) to defeat the system. This has led to the development of various presentation attack detection (PAD) algorithms, which typically perform well in intra‑domain settings. However, they often struggle to generalize effectively in cross‑domain scenarios, where training and testing employ different sensors, PA instruments, and datasets. In this work, we use adversarial training samples of both bonafide irides and PAs to improve the cross‑domain performance of a PAD classifier. The novelty of our approach lies in leveraging transformation parameters from classical data augmentation schemes (e.g., translation, rotation) to generate adversarial samples. We achieve this through a convolutional autoencoder, ADV‑GEN, that inputs original training samples along with a set of geometric and photometric transformations. The transformation parameters act as regularization variables, guiding ADV‑GEN to generate adversarial samples in a constrained search space. Experiments conducted on the LivDet‑Iris 2017 database, comprising four datasets, and the LivDet‑Iris 2020 dataset, demonstrate the efficacy of our proposed method. The code is available at https://github.com/iPRoBe‑lab/ADV‑GEN‑IrisPAD.

Abstract:
LLM watermarks stand out as a promising way to attribute ownership of LLM‑generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state‑of‑the‑art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post‑hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning‑based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning‑based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat. We make all our code available at https://github.com/eth‑sri/watermark‑spoofing‑detection .

Abstract:
Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. Existing FFD methods primarily leverage pre‑trained backbones with general image representation capabilities and fine‑tune them to identify facial forgery cues. However, these backbones lack domain‑specific facial knowledge and insufficiently capture complex facial features, thus hindering effective implicit forgery cue identification and limiting generalization. Therefore, it is essential to revisit FFD workflow across the pre‑training and fine‑tuning stages, achieving an elaborate integration from facial representation to forgery detection to improve generalization. Specifically, we develop an FFD‑specific pre‑trained backbone with superior facial representation capabilities through self‑supervised pre‑training on real faces. We then propose a competitive fine‑tuning framework that stimulates the backbone to identify implicit forgery cues through a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our method achieves excellent performance in FFD and extra face‑related tasks, \ie, presentation attack detection. Code and models are available at \hrefhttps://github.com/zhenglab/FFDBackbonehttps://github.com/zhenglab/FFDBackbone.

Abstract:
Face anti‑spoofing (FAS) plays a vital role in preventing face recognition (FR) systems from presentation attacks. Nowadays, FAS systems face the challenge of domain shift, impacting the generalization performance of existing FAS methods. In this paper, we rethink about the inherence of domain shift and deconstruct it into two factors: image style and image quality. Quality influences the purity of the presentation of spoof information, while style affects the manner in which spoof information is presented. Based on our analysis, we propose DiffFAS framework, which quantifies quality as prior information input into the network to counter image quality shift, and performs diffusion‑based high‑fidelity cross‑domain and cross‑attack types generation to counter image style shift. DiffFAS transforms easily collectible live faces into high‑fidelity attack faces with precise labels while maintaining consistency between live and spoof face identities, which can also alleviate the scarcity of labeled data with novel type attacks faced by nowadays FAS system. We demonstrate the effectiveness of our framework on challenging cross‑domain and cross‑attack FAS datasets, achieving the state‑of‑the‑art performance. Available at https://github.com/murphytju/DiffFAS.

Abstract:
Face Anti‑Spoofing (FAS) research is challenged by the cross‑domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model‑centric, focusing on developing domain generalization algorithms for improving cross‑domain performance, data‑centric research for face anti‑spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data‑centric FAS by conducting a comprehensive investigation from the data perspective for improving cross‑domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task‑specific FAS data augmentation (FAS‑Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, etc. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross‑domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment‑invariant, and using FAS‑Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS‑Aug and SARE with recent Vision Transformer backbones can achieve state‑of‑the‑art performance on the FAS cross‑domain generalization protocols. The implementation is available at https://github.com/RizhaoCai/FAS_Aug.

Abstract:
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI‑generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti‑spoofing systems. We also introduce a novel Squeeze‑and‑Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.

Abstract:
Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA‑SURF Cross‑Ethnicity Face Anti‑Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT‑Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT‑S fine‑tuned on CeFA across African, East Asian, and zero‑shot Central Asian demographic groups. DeiT‑S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT‑S reduces the inter‑ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP‑based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero‑shot Central Asian subjects, DeiT‑S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross‑demographic fairness in PAD may partly be influenced by architectural design.

Abstract:
Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti‑spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self‑supervised speech representation model into a Mixture‑of‑Experts (MoE) architecture to improve generalization. Feed‑forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer‑wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self‑supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.

Abstract:
Saliency‑guided training is a paradigm in visual recognition that encourages models to focus on the most relevant image regions during learning. While its application in biometric presentation attack detection (PAD) has shown strong benefits in robustness and generalization, adoption is often limited by the high cost, domain specificity, and limited scalability of existing saliency acquisition methods, such as human annotations over a limited dataset. We present a novel, cost‑efficient, and highly‑scalable approach to saliency acquisition using maps inspired by classical dimensionality reduction techniques: PCA and LDA. Our proposed methods generate saliency maps directly from raw training data, requiring no human annotation nor domain knowledge. We contextualize the effectiveness of these saliency sources in three saliency‑explored domains (iris PAD, synthetic face detection, fingerprint PAD) and demonstrate its scalability in two saliency‑novel domains (fingerprint vein PAD and ID card PAD). Across all domains tested, models trained using dimensionality reduction‑sourced saliency maps exceed baseline and sometimes SOTA saliency methods without any resource investment or domain‑specific tooling. Our findings overcome an important yet unaddressed barrier to saliency‑guided training for biometric attack detection and beyond.

Abstract:
This paper addresses the challenge of assessing image quality in ID cards in remote verification systems by applying capture‑related quality measures from the Open Face Image Quality (OFIQ) standard to ID card images. Our preprocessing pipeline includes corner detection, perspective normalization, and comprehensive foreground masking to ensure accurate and unbiased quality measure computation. We evaluate the effectiveness of these measures by analyzing their correlation with the performance of three presentation attack detection (PAD) algorithms across four diverse ID card datasets, where two datasets contain bona fide, i.e. pristine, images and two contain printed mock ID cards. Our results suggest that quality assessment based on some OFIQ measures can significantly improve PAD performance.

Abstract:
We present SpAArSIST, a deployment‑oriented refinement of the widely used AASIST graph pooling backend for self‑supervised learning (SSL) based anti‑spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack‑node attention with explicit, lightweight choices: separate train and inference graph pooling ratios (k_\mathrmtr,k_\mathrminf), magnitude‑based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M \rightarrow 154.706M MACs) and model size by 4.1% (611.8k \rightarrow 586.4k params), while improving out‑of‑domain robustness on In‑the‑Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment‑oriented model choice.

Abstract:
Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attackers exploit counterfeit biometric data to gain unauthorized access. This research evaluates the effectiveness of state‑of‑the‑art machine learning models, MobileNetV2, DenseNet‑121, Inception‑v3, and Spoof Trace Disentanglement (STD) in detecting spoofing attacks within facial recognition systems. Using the CelebA‑Spoof dataset, the study evaluates model effectiveness using metrics such as accuracy, precision, recall, and F1 Score. Cross‑dataset validation is carried out on the MSU‑MFSD dataset to assess generalizability. The results show MobileNetV2 as the most efficient model, achieving 92% accuracy while balancing computational effectiveness, making it appropriate for real‑life applications. Inception‑v3 shows moderate robustness, while DenseNet‑121 and STD struggle with generalization. The findings highlight the need for advances in domain adaptation and hybrid architectures to enhance biometric security systems.

Abstract:
We introduce a spoofing countermeasure architecture conditioned on speaker‑reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference‑Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single‑utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state‑of‑the‑art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

Abstract:
The Environment‑Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component‑level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top‑performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro‑F1 score of 0.8775, substantially outperforming the separation‑enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross‑domain self‑supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment‑aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.

Abstract:
Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in‑domain conditions, generalisation to out‑of‑domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher‑student framework for speaker‑invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre‑trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

Abstract:
Voice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self‑supervised learning feature extractors paired with four back‑end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph‑based back‑ends. Through multi‑corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross‑linguistic analysis reveals that fine‑tuning with just 8 hours of target‑language data enhances detection robustness. Together, these findings emphasize the critical need for domain‑aware and language‑specific adaptation in spoofing detection.

Abstract:
The scale of speech anti‑spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale‑first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross‑domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger‑scale datasets with limited diversity in cross‑dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.

Abstract:
Cross‑domain shifts challenge Presentation Attack Detection (PAD) on ID Cards, given the restricted data available due to privacy concerns. This work proposes a compact multimodal model, based on new generative and discriminative blocks, which combines visual and textual data for PAD on genuine and synthetic ID images. While multimodal models exhibit strong generalisation after supervised fine‑tuning, they fail in zero‑shot settings. Our findings underscore that model capacity and real‑world data are essential for reliable PAD, while existing synthetic datasets may not reflect real‑world challenges. We argue for a re‑evaluation of synthetic data as a benchmark and emphasise the need for more realistic, diverse datasets to advance PAD research.

Abstract:
Synthetic and manipulated speech can reduce the reliability of automatic speaker verification systems, so anti‑spoofing methods need to be both accurate and efficient in training and inference. This paper focuses on the ASVspoof 5 Track 1 closed condition, where standard cross‑entropy training may not give enough attention to hard trials and is not directly aligned with ranking‑ and threshold‑based evaluation metrics. We propose TFPARN, a Transformer‑based focal‑pairwise attentive ranking network. The system extracts log‑Mel features from speech, uses a Transformer encoder to model frame‑level information, applies attention pooling to obtain utterance‑level representations, and is trained with a combination of focal classification loss and pairwise ranking loss. RawBoost augmentation is used during training, and test‑time augmentation is applied during evaluation to improve robustness. Compared with re‑implemented AASIST and RawNet2 baselines under the same protocol, TFPARN achieves the best results, with a minDCF of 0.2430 and an EER of 12.52%. Ablation experiments further show that the pairwise loss, focal loss, and attention pooling all improve performance. TFPARN also uses the lowest inference memory among the compared systems, at 1.4 GB, runs at about 0.79 ms per utterance, and reaches its best checkpoint in less training time than AASIST. These results show that TFPARN provides a good balance between detection accuracy and computational cost for logical access anti‑spoofing.

Abstract:
Multi‑modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi‑modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under‑explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG‑Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video‑audio‑flow action recognition and RGB‑Depth‑IR face anti‑spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state‑of‑the‑art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non‑DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter‑modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross‑domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG‑Bench provides a principled foundation and actionable design guidelines for future research in multi‑modal robustness. Code is released at https://github.com/qszhan/MMDG‑Bench.

Abstract:
Public datasets such as DLC‑2021, SynID, and KID34K have significantly contributed to research on presentation attack detection for identity documents, including screen replay attacks. However, evaluation of out‑of‑domain (OOD) robustness remains insufficiently explored, especially under realistic domain shifts. In this work, we introduce Receipt Replay OOD, a small out‑of‑domain benchmark for screen replay detection. Receipts share several characteristics with identity documents, including planar geometry, curved corners, wear‑and‑tear artifacts, and text or logo patterns, while avoiding personally identifiable information constraints commonly associated with identity documents. We evaluate document replay detection models under cross‑domain conditions and demonstrate the impact of domain shift on generalization performance. The dataset is publicly available.

Abstract:
Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open‑set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general‑purpose vision foundation models for open‑set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open‑set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross‑spectral transfer from near‑infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter‑efficient task adaptation using Low‑Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross‑spectral evaluation. While LoRA improves performance in certain cross‑dataset settings, it frequently amplifies failure under attack‑level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine‑tuning, joint cross‑dataset and cross‑PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one‑directional spectral evaluation. These findings show that strong closed‑set or cross‑dataset performance should not be treated as evidence of robust open‑set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.

Abstract:
Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask‑based spoofing, makeup‑induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real‑time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual‑branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel‑encoded optical flow, enabling effective modeling of micro‑motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion‑aware knowledge from the flow‑augmented teacher to a lightweight RGB‑only student via logit distillation. As a result, the student implicitly learns motion‑sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay‑Attack and Replay‑Mobile, 0.94% HTER on ROSE‑Youtu, 5.65% HTER on SiW‑Mv2, and 0.42% ACER on OULU‑NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real‑time and resource‑constrained FacePAD deployment.

Abstract:
Searching a multi‑biometric database of a billion records for a country‑level identity system requires pushing the limits of all aspects of a biometric system, including acquisition, preprocessing, feature extraction, accuracy, matching speed, presentation attack detection, and handling of special cases (e.g., missing finger digits). This is the first paper that gives insights into such a large‑scale multimodal biometric search system, called Bharat ABIS, based on open‑source architectures. The end‑to‑end pipeline of Bharat ABIS processes fingerprint, face and iris modalities through modality‑specific stages of preprocessing (segmentation), quality assessment, presentation attack detection, and learning an embedding (feature extraction), producing a concatenated template of 13.5KB per person. We present a detailed analysis of the modalities and how they are integrated to create an efficient and effective solution for 1:N search (de‑duplication). Evaluations on a demographically stratified gallery of 220 million identities, randomly sampled from 1.55 billion records in India's Aadhaar database, yield an FNIR of 0.3% at an FPIR of 0.5%, for adult probes (over 18 years). We also compare the performance of Bharat ABIS against three state‑of‑the‑art COTS systems on a 20M gallery. Our system achieves a throughput of 100 searches per second on a gallery of 40M on a single server (8xNvidia H100 GPUs, 2TB RAM).

Abstract:
Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio‑focused approaches still treat spectrograms as generic images and do not explicitly exploit their time‑frequency structure. We propose Q‑Patch, a quantum feature map tailored to audio that encodes local time‑frequency patches from mel‑spectrograms into quantum states using shallow, hardware‑efficient circuits with adjacency‑aware entanglement. Each selected patch is summarized by a compact four‑dimensional acoustic descriptor and mapped to a four‑qubit circuit with depth at most three, enabling practical quantum kernel construction under near‑term constraints. We evaluate Q‑Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size‑matched classical baselines. Q‑Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF‑SVM) trained on the same patch‑level features. Kernel‑space analysis further reveals a clear class structure, with cross‑class similarity around 0.615 and within‑class self‑similarity of 1.00. Overall, Q‑Patch provides a practical framework for incorporating time‑frequency‑aware representations into quantum kernel learning for audio authenticity assessment in low‑resource settings.

Abstract:
Mobile remote identity verification (RIdV) systems are exposed to attacks that manipulate or replace the facial video stream, including presentation attacks, real‑time deepfakes, and video injection. Recent European requirements, including ETSI TS 119 461 and CEN/TS 18099, motivate complementary evidence channels beyond camera‑based presentation‑attack detection. This paper investigates whether passive motion traces recorded during selfie capture provide auxiliary evidence for spoof screening and user verification. We introduce CanSelfie, a dataset of 375 bona fide multi‑sensor sequences collected at 50\,Hz from 30 participants using a commercial mobile RIdV application, together with stationary, handheld, and temporally shifted attack‑proxy scenarios. We benchmark 7 multivariate time‑series classifiers and 8 whole‑series anomaly detectors across sensor configurations and temporal windows. For spoof screening, accelerometer‑only ROCKAD obtains 0.00% false rejection rate (FRR) and 43.8% false acceptance rate (FAR), while QUANT+3‑NN obtains the lowest overall FAR of 32.0% at 2.37% FRR; both reject all stationary attack proxies. For same‑device and same‑session user verification, WEASEL+MUSE reaches 1.07% equal error rate (EER) using 9 sensor channels. The analysis shows that raw accelerometer data, preserving gravity and orientation cues, is the most informative modality, and that closed‑set classification accuracy alone does not imply good verification performance because threshold calibration depends on score distributions. The findings suggest that short selfie‑capture motion traces contain measurable spoof‑related and identity‑related information, supporting their use as a low‑friction auxiliary signal while also identifying the need for cross‑device, cross‑session, and real injection‑attack evaluation.

Abstract:
Despite significant advances in facial recognition systems, they remain vulnerable to face presentation attacks. Among them, disguise makeup attacks are particularly challenging, as they use advanced cosmetics, prosthetic components, and artificial materials to realistically alter facial appearance, often making detection difficult even for humans. Despite their importance, this problem remains underexplored, and publicly available datasets are limited. To address this, we propose a generalized disguise makeup presentation attack detection framework. The method adopts a two‑phase design in which a style‑invariant full‑face model, trained with metric learning and enhanced by a whitening transformation, extracts region attention scores via Grad‑CAM. These scores guide a patch‑based phase that performs localized analysis using region‑specific subnetworks trained with metric learning for fine‑grained discrimination. We also construct a new, diverse dataset of live and disguise makeup faces collected under real‑world conditions, covering variations in subjects, environments, and disguise materials. Experimental results demonstrate strong generalization across both the collected dataset and SIW‑Mv2, achieving 8.97% ACER and 9.76% EER on the collected dataset, and 0% ACER on Obfuscation and Impersonation and 1.34% on Cosmetics attacks of SIW‑Mv2. The proposed method consistently outperforms prior works while maintaining robust performance across other spoof types.

Abstract:
Face recognition systems are often used for biometric authentication. Nevertheless, it is known that without any protective measures, face recognition systems are vulnerable to presentation attacks. To tackle this security problem, methods for detecting presentation attacks have been developed and shown good detection performance on several benchmark datasets. However, generalising presentation attack detection methods to new and novel types of attacks is an ongoing challenge. In this work, we employ 1,608 T‑shirt attacks of the T‑shirt Face Presentation Attack (TFPA) database using 100 unique presentation attack instruments together with 152 bona fide presentations. In a comprehensive evaluation, we show that this type of attack can compromise the security of face recognition systems. Furthermore, we propose a detection method based on spatial consistency checks in order to detect said T‑shirt attacks. Precisely, state‑of‑the‑art face and person detectors are combined to analyse the spatial positions of detected faces and persons based on which T‑shirt attacks can be reliably detected.

Abstract:
Face Anti‑Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch‑based and multi‑task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN's effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.

Abstract:
The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre‑trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.

Abstract:
Embedding high‑dimensional data into resource‑limited quantum devices remains a significant challenge for practical quantum machine learning. In multimodal face anti‑spoofing, while linear compression methods such as principal component analysis can reduce dimensionality to accommodate limited quantum budgets, such approaches often lose critical high‑order cross‑modal correlations due to the loss of structural information. To this end, we propose a hybrid Matrix Product State (MPS)‑Variational Quantum Circuit (VQC) framework, where the MPS serves as a structured, differentiable pre‑quantum compression and fusion module, and the VQC acts as the quantum classifier. Built upon the low‑rank structure controlled by the virtual bond dimension and integrated with a configurable nonlinear enhancement mechanism, this MPS module explicitly models long‑range cross‑modal correlations while compressing multimodal data into a compact representation matching the quantum budget and improving numerical stability under extreme compression. Experiments on the CASIA‑SURF benchmark demonstrate that MPS‑VQC achieves accuracy comparable to strong classical neural network baselines with fewer than 0.25M parameters, highlighting the parameter efficiency of tensor‑network representations for high‑dimensional multimodal data under tight resource budgets. Leveraging the intrinsic compatibility between MPS structures and quantum circuit topology, this framework not only provides a viable technological pathway for efficient multimodal anti‑spoofing on NISQ devices but also serves as a stepping stone toward fully quantum implementations of such tasks in the future.

Abstract:
The performance of speech spoofing detection often varies across different training and evaluation corpora. Leveraging multiple corpora typically enhances robustness and performance in fields like speaker recognition and speech recognition. However, our spoofing detection experiments show that multi‑corpus training does not consistently improve performance and may even degrade it. We hypothesize that dataset‑specific biases impair generalization, leading to performance instability. To address this, we propose an Invariant Domain Feature Extraction (IDFE) framework, employing multi‑task learning and a gradient reversal layer to minimize corpus‑specific information in learned embeddings. The IDFE framework reduces the average equal error rate by 20% compared to the baseline, assessed across four varied datasets.

Abstract:
Human perceptual priors have shown promise in saliency‑guided deep learning training, particularly in the domain of iris presentation attack detection (PAD). Common saliency approaches include hand annotations obtained via mouse clicks and eye gaze heatmaps derived from eye tracking data. However, the most effective form of human saliency for open‑set iris PAD remains under‑explored. In this paper, we conduct a series of experiments comparing hand annotations, eye tracking heatmaps, segmentation masks, and foundation model embeddings to a state‑of‑the‑art deep learning‑based baseline on the task of open‑set iris PAD. Results for open‑set PAD in a leave‑one‑attack‑type out paradigm indicate that denoised eye tracking heatmaps show the best generalization improvement over cross entropy in Attack Presentation Classification Error Rate (APCER) at Bona Fide Presentation Classification Error Rate (BPCER) of 1%. Along with this paper, we offer trained models, code, and saliency maps for reproducibility and to facilitate follow‑up research efforts.

Abstract:
Contactless fingerprint recognition enables hygienic and convenient biometric authentication but poses new challenges for spoof detection due to the absence of physical contact and traditional liveness cues. Most existing methods rely on single‑image acquisition and appearance‑based features, which often generalize poorly across devices, capture conditions, and spoof materials. In this work, we study paired flash‑non‑flash contactless fingerprint acquisition as a lightweight active sensing mechanism for spoof detection. Through a preliminary empirical analysis, we show that flash illumination accentuates material‑ and structure‑dependent properties, including ridge visibility, subsurface scattering, micro‑geometry, and surface oils, while non‑flash images provide a baseline appearance context. We analyze lighting‑induced differences using interpretable metrics such as inter‑channel correlation, specular reflection characteristics, texture realism, and differential imaging. These complementary features help discriminate genuine fingerprints from printed, digital, and molded presentation attacks. We further examine the limitations of paired acquisition, including sensitivity to imaging settings, dataset scale, and emerging high‑fidelity spoofs. Our findings demonstrate the potential of illumination‑aware analysis to improve robustness and interpretability in contactless fingerprint presentation attack detection, motivating future work on paired acquisition and physics‑informed feature design. Code is available in the repository.

Abstract:
Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse‑enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general‑purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre‑trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB‑restricted dataset of 224 iris images spanning seven attack types, using only university‑approved services (Gemini 2.5 Pro) or locally‑hosted models (e.g., Llama 3.2‑Vision), we show that Gemini with expert‑informed prompts outperforms both a specialized convolutional neural networks (CNN)‑based baseline and human examiners, while the locally‑deployable Llama achieves near‑human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.

Abstract:
Logical Access (LA) attacks, also known as audio deepfake attacks, use Text‑to‑Speech (TTS) or Voice Conversion (VC) methods to generate spoofed speech data. This can represent a serious threat to Automatic Speaker Verification (ASV) systems, as intruders can use such attacks to bypass voice biometric security. In this study, we investigate the correlation between speech quality and the performance of audio spoofing detection systems (i.e., LA task). For that, the performance of two enhancement algorithms is evaluated based on two perceptual speech quality measures, namely Perceptual Evaluation of Speech Quality (PESQ) and Speech‑to‑Reverberation Modulation Ratio (SRMR), and in respect to their impact on the audio spoofing detection system. We adopted the LA dataset, provided in the ASVspoof 2019 Challenge, and corrupted its test set with different Signal‑to‑Noise Ratio (SNR) levels, while leaving the training data untouched. Enhancement was applied to attenuate the detrimental effects of noisy speech, and the performances of two models, Speech Enhancement Generative Adversarial Network (SEGAN) and Metric‑Optimized Generative Adversarial Network Plus (MetricGAN+), were compared. Although we expect that speech quality will correlate well with speech applications' performance, it can also have as a side effect on downstream tasks if unwanted artifacts are introduced or relevant information is removed from the speech signal. Our results corroborate with this hypothesis, as we found that the enhancement algorithm leading to the highest speech quality scores, MetricGAN+, provided the lowest Equal Error Rate (EER) on the audio spoofing detection task, whereas the enhancement method with the lowest speech quality scores, SEGAN, led to the lowest EER, thus leading to better performance on the LA task.

Abstract:
Audio anti‑spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation‑modifying voice conversion and speech restoration are treated as out‑of‑distribution despite preserving speaker authenticity. Using a multi‑class setup separating bona fide, converted, spoofed, and converted‑spoofed speech, we analyse model behaviour through self‑supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti‑spoofing as a multi‑class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.

Abstract:
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV‑VASM, a probabilistic framework for verifying the robustness of voice anti‑spoofing models (VASMs). PV‑VASM estimates the probability of misclassification under text‑to‑speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model‑agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.

Abstract:
Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti‑Spoofing (FAS) solutions. Recent MLLM‑based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross‑domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine‑grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool‑Augmented Reasoning FAS (TAR‑FAS) framework, which reformulates the FAS task as a Chain‑of‑Thought with Visual Tools (CoT‑VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine‑grained investigation. To this end, we design a tool‑augmented data annotation pipeline and construct the ToolFAS‑16K dataset, which contains multi‑turn tool‑use reasoning trajectories. Furthermore, we introduce a tool‑aware FAS training pipeline, where Diverse‑Tool Group Relative Policy Optimization (DT‑GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one‑to‑eleven cross‑domain protocol demonstrate that TAR‑FAS achieves SOTA performance while providing fine‑grained visual investigation for trustworthy spoof detection.

Abstract:
Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity. However, this assumption remains unverified. In this paper, we investigate the impact of speaker information on spoofing detection systems. We propose two approaches within our Speaker‑Invariant Multi‑Task framework, one that models speaker identity within the embeddings and another that removes it. SInMT integrates multi‑task learning for joint speaker recognition and spoofing detection, incorporating a gradient reversal layer. Evaluated using four datasets, our speaker‑invariant model reduces the average equal error rate by 17% compared to the baseline, with up to 48% reduction for the most challenging attacks (e.g., A11).

Abstract:
Face presentation attack detection (FacePAD) is critical for securing facial authentication against print, replay, and mask‑based spoofing. This paper proposes CASO‑PAD, an RGB‑only, single‑frame model that enhances MobileNetV3 with content‑adaptive spatial operators (involution) to better capture localized spoof cues. Unlike spatially shared convolution kernels, the proposed operator generates location‑specific, channel‑shared kernels conditioned on the input, improving spatial selectivity with minimal overhead. CASO‑PAD remains lightweight (3.6M parameters; 0.64 GFLOPs at 256×256) and is trained end‑to‑end using a standard binary cross‑entropy objective. Extensive experiments on Replay‑Attack, Replay‑Mobile, ROSE‑Youtu, and OULU‑NPU demonstrate strong performance, achieving 100/100/98.9/99.7% test accuracy, AUC of 1.00/1.00/0.9995/0.9999, and HTER of 0.00/0.00/0.82/0.44%, respectively. On the large‑scale SiW‑Mv2 Protocol‑1 benchmark, CASO‑PAD further attains 95.45% accuracy with 3.11% HTER and 3.13% EER, indicating improved robustness under diverse real‑world attacks. Ablation studies show that placing the adaptive operator near the network head and using moderate group sharing yields the best accuracy‑‑efficiency balance. Overall, CASO‑PAD provides a practical pathway for robust, on‑device FacePAD with mobile‑class compute and without auxiliary sensors or temporal stacks.

Abstract:
Multi‑branch deep neural networks like AASIST3 achieve state‑of‑the‑art comparable performance in audio anti‑spoofing, yet their internal decision dynamics remain opaque compared to traditional input‑level saliency methods. While existing interpretability efforts largely focus on visualizing input artifacts, the way individual architectural branches cooperate or compete under different spoofing attacks is not well characterized. This paper develops a framework for interpreting AASIST3 at the component level. Intermediate activations from fourteen branches and global attention modules are modeled with covariance operators whose leading eigenvalues form low‑dimensional spectral signatures. These signatures train a CatBoost meta‑classifier to generate TreeSHAP‑based branch attributions, which we convert into normalized contribution shares and confidence scores (Cb) to quantify the model's operational strategy. By analyzing 13 spoofing attacks from the ASVspoof 2019 benchmark, we identify four operational archetypes‑ranging from Effective Specialization (e.g., A09, Equal Error Rate (EER) 0.04%, C=1.56) to Ineffective Consensus (e.g., A08, EER 3.14%, C=0.33). Crucially, our analysis exposes a Flawed Specialization mode where the model places high confidence in an incorrect branch, leading to severe performance degradation for attacks A17 and A18 (EER 14.26% and 28.63%, respectively). These quantitative findings link internal architectural strategy directly to empirical reliability, highlighting specific structural dependencies that standard performance metrics overlook.

Abstract:
Large audio‑language models (LALMs) exhibit strong zero‑shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class‑Conditional Sparse Attention Vectors for Large Audio‑Language Models, a few‑shot classification method that learns class‑dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few‑shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state‑of‑the‑art uniform voting‑based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio‑visual classification, and spoofing detection respectively.

Abstract:
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti‑spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic‑level analysis, we introduce DailyTalkEdit, a new anti‑spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in‑context learning further improves out‑of‑domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.

Abstract:
Presentation attack detection (PAD) subsystems are an important part of effective and user‑friendly remote identity validation (RIV) systems. However, ensuring robust performance across diverse environmental and procedural conditions remains a critical challenge. This paper investigates the impact of low‑light conditions and automated image acquisition on the robustness of commercial PAD systems using a scenario test of RIV. Our results show that PAD systems experience a significant decline in performance when utilized in low‑light or auto‑capture scenarios, with a model‑predicted increase in error rates by a factor of about four under low‑light conditions and a doubling of those odds under auto‑capture workflows. Specifically, only one of the tested systems was robust to these perturbations, maintaining a maximum bona fide presentation classification error rate below 3% across all scenarios. Our findings emphasize the importance of testing across diverse environments to ensure robust and reliable PAD performance in real‑world applications.

Abstract:
The widespread adoption of deep‑learning models in data‑driven applications has drawn attention to the potential risks associated with biased datasets and models. Neglected or hidden biases within datasets and models can lead to unexpected results. This study addresses the challenges of dataset bias and explores ``shortcut learning'' or ``Clever Hans effect'' in binary classifiers. We propose a novel framework for analyzing the black‑box classifiers and for examining the impact of both training and test data on classifier scores. Our framework incorporates intervention and observational perspectives, employing a linear mixed‑effects model for post‑hoc analysis. By evaluating classifier performance beyond error rates, we aim to provide insights into biased datasets and offer a comprehensive understanding of their influence on classifier behavior. The effectiveness of our approach is demonstrated through experiments on audio anti‑spoofing and speaker verification tasks using both statistical models and deep neural networks. The insights gained from this study have broader implications for tackling biases in other domains and advancing the field of explainable artificial intelligence.

Abstract:
The development of robust, multilingual speaker recognition systems is hindered by a lack of large‑scale, publicly available and multilingual datasets, particularly for the read‑speech style crucial for applications like anti‑spoofing. To address this gap, we introduce the TidyVoice dataset derived from the Mozilla Common Voice corpus after mitigating its inherent speaker heterogeneity within the provided client IDs. TidyVoice currently contains training and test data from over 212,000 monolingual speakers (Tidy‑M) and around 4,500 multilingual speakers (Tidy‑X) from which we derive two distinct conditions. The Tidy‑M condition contains target and non‑target trials from monolingual speakers across 81 languages. The Tidy‑X condition contains target and non‑target trials from multilingual speakers in both same‑ and cross‑language trials. We employ two architectures of ResNet models, achieving a 0.35% EER by fine‑tuning on our comprehensive Tidy‑M partition. Moreover, we show that this fine‑tuning enhances the model's generalization, improving performance on unseen conversational interview data from the CANDOR corpus. The complete dataset, evaluation trials, and our models are publicly released to provide a new resource for the community.

Abstract:
Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or measuring such correlations. However, existing methods often treat acoustic and emotional features in isolation or rely on correlation metrics, which overlook subtle desynchronization between them and smooth out abrupt discontinuities. To address these issues, we propose EAI‑ADD, which treats cross level emotion acoustic inconsistency as the primary detection signal. We first project emotional and acoustic representations into a comparable space. Then we progressively integrate frame level and utterance level emotion features with acoustic features to capture cross level emotion acoustic inconsistencies across different temporal granularities. Experimental results on the ASVspoof 2019LA and 2021LA datasets demonstrate that the proposed EAI‑ADD outperforms baselines, providing a more effective solution for audio anti spoofing detection.

Abstract:
In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade‑off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating‑point operations per second (FLOP/s), are non‑differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD‑based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance‑complexity trade‑off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real‑world applications: voice activity detection and audio anti‑spoofing. The code related to our work is publicly available to encourage further research.

Abstract:
Speaker‑specific anti‑spoofing and synthesis‑source tracing are central challenges in audio anti‑spoofing. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we introduce LJ‑Spoof, a speaker‑specific, generatively diverse corpus that systematically varies prosody, vocoders, generative hyperparameters, bona fide prompt sources, training regimes, and neural post‑processing. The corpus spans one speakers‑including studio‑quality recordings‑30 TTS families, 500 generatively variant subsets, 10 bona fide neural‑processing variants, and more than 3 million utterances. This variation‑dense design enables robust speaker‑conditioned anti‑spoofing and fine‑grained synthesis‑source tracing. We further position this dataset as both a practical reference training resource and a benchmark evaluation suite for anti‑spoofing and source tracing.

Abstract:
Audio recorded in real‑world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text‑to‑speech, voice conversion, and other generation models, either component can now be modified independently. Such component‑level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation‑enhanced joint learning framework. CompSpoofV2 is a large‑scale curated dataset designed for component‑level audio anti‑spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation‑enhanced joint learning framework, we launch the Environment‑Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component‑level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).

Abstract:
As audio deepfakes transition from research artifacts to widely available commercial tools, robust biometric authentication faces pressing security threats in high‑stakes industries. This paper presents a systematic empirical evaluation of state‑of‑the‑art speaker authentication systems based on a large‑scale speech synthesis dataset, revealing two major security vulnerabilities: 1) modern voice cloning models trained on very small samples can easily bypass commercial speaker verification systems; and 2) anti‑spoofing detectors struggle to generalize across different methods of audio synthesis, leading to a significant gap between in‑domain performance and real‑world robustness. These findings call for a reconsideration of security measures and stress the need for architectural innovations, adaptive defenses, and the transition towards multi‑factor authentication.

Abstract:
Autonomous Vehicles (AVs) refer to systems capable of perceiving their states and moving without human intervention. Among the factors required for autonomous decision‑making in mobility, positional awareness of the vehicle itself is the most critical. Accordingly, extensive research has been conducted on defense mechanisms against GPS spoofing attacks, which threaten AVs by disrupting position recognition. Among these, detection methods based on internal IMU sensors are regarded as some of the most effective. In this paper, we propose a spoofing attack system designed to neutralize IMU sensor‑based detection. First, we present an attack modeling approach for bypassing such detection. Then, based on EKF sensor fusion, we experimentally analyze both the impact of GPS spoofing values on the internal target system and how our proposed methodology reduces anomaly detection within the target system. To this end, this paper proposes an attack model that performs GPS spoofing by stealing internal dynamic state information using an external IMU sensor, and the experimental results demonstrate that attack values can be injected without being detected.

Abstract:
Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal‑free IL (RF‑IL). Vision‑Language Pre‑trained (VLP) models, with their prompt‑tunable cross‑modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose SVLP‑IL, a VLP‑based RF‑IL framework that balances stability and plasticity via Multi‑Aspect Prompting (MAP) and Selective Elastic Weight Consolidation (SEWC). MAP isolates domain dependencies, enhances distribution‑shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain‑specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP‑IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP‑IL offers a privacy‑compliant, practical solution for robust lifelong PAD deployment in RF‑IL settings.

Abstract:
Face anti‑spoofing (FAS) is a vital component of remote biometric authentication systems based on facial recognition, increasingly used across web‑based applications. Among emerging threats, video injection attacks ‑‑ facilitated by technologies such as deepfakes and virtual camera software ‑‑ pose significant challenges to system integrity. While virtual camera detection (VCD) has shown potential as a countermeasure, existing literature offers limited insight into its practical implementation and evaluation. This study introduces a machine learning‑based approach to VCD, with a focus on its design and validation. The model is trained on metadata collected during sessions with authentic users. Empirical results demonstrate its effectiveness in identifying video injection attempts and reducing the risk of malicious users bypassing FAS systems.

Abstract:
Iris recognition is widely recognized as one of the most accurate biometric modalities. However, its growing deployment in real‑world applications raises significant concerns regarding its vulnerability to Presentation Attacks (PAs). Effective Presentation Attack Detection (PAD) is therefore critical to ensure the integrity and security of iris‑based biometric systems. While conventional iris recognition systems predominantly operate in the near‑infrared (NIR) spectrum, multispectral imaging across multiple NIR bands provides complementary reflectance information that can enhance the generalizability of PAD methods. In this work, we propose SpectraIrisPAD, a novel deep learning‑based framework for robust multispectral iris PAD. The SpectraIrisPAD leverages a DINOv2 Vision Transformer (ViT) backbone equipped with learnable spectral positional encoding, token fusion, and contrastive learning to extract discriminative, band‑specific features that effectively distinguish bona fide samples from various spoofing artifacts. Furthermore, we introduce a new comprehensive dataset Multispectral Iris PAD (MSIrPAD) with diverse PAIs, captured using a custom‑designed multispectral iris sensor operating at five distinct NIR wavelengths (800\,nm, 830\,nm, 850\,nm, 870\,nm, and 980\,nm). The dataset includes 18,848 iris images encompassing eight diverse PAI categories, including five textured contact lenses, print attacks, and display‑based attacks. We conduct comprehensive experiments under unseen attack evaluation protocols to assess the generalization capability of the proposed method. SpectraIrisPAD consistently outperforms several state‑of‑the‑art baselines across all performance metrics, demonstrating superior robustness and generalizability in detecting a wide range of presentation attacks.

Abstract:
Face anti‑spoofing (FAS) has recently advanced in multimodal fusion, cross‑domain generalization, and interpretability. With large language models and reinforcement learning (RL), strategy‑based training offers new opportunities to jointly model these aspects. However, multimodal reasoning is more complex than unimodal reasoning, requiring accurate feature representation and cross‑modal verification while facing scarce, high‑quality annotations, which makes direct application of RL sub‑optimal. We identify two key limitations of supervised fine‑tuning plus RL (SFT+RL) for multimodal FAS: (1) limited multimodal reasoning paths restrict the use of complementary modalities and shrink the exploration space after SFT, weakening the effect of RL; and (2) mismatched single‑task supervision versus diverse reasoning paths causes reasoning confusion, where models may exploit shortcuts by mapping images directly to answers and ignoring the intended reasoning. To address this, we propose PA‑FAS, which enhances reasoning paths by constructing high‑quality extended reasoning sequences from limited annotations, enriching paths and relaxing exploration constraints. We further introduce an answer‑shuffling mechanism during SFT to force comprehensive multimodal analysis instead of using superficial cues, thereby encouraging deeper reasoning and mitigating shortcut learning. PA‑FAS significantly improves multimodal reasoning accuracy and cross‑domain generalization, and better unifies multimodal fusion, generalization, and interpretability for trustworthy FAS.

Abstract:
The rapid growth of quantum computing poses a threat to the cryptographic foundations of digital systems, requiring the development of secure and scalable electronic voting (evoting) frameworks. We introduce a post‑quantum‑secure evoting architecture that integrates Falcon lattice‑based digital signatures, biometric authentication via MobileNetV3 and AdaFace, and a permissioned blockchain for tamper‑proof vote storage. Voter registration involves capturing facial embeddings, which are digitally signed using Falcon and stored on‑chain to ensure integrity and non‑repudiation. During voting, real‑time biometric verification is performed using anti‑spoofing techniques and cosine‑similarity matching. The system demonstrates low latency and robust spoof detection, monitored through Prometheus and Grafana for real‑time auditing. The average classification error rates (ACER) are below 3.5% on the CelebA Spoof dataset and under 8.2% on the Wild Face Anti‑Spoofing (WFAS) dataset. Blockchain anchoring incurs minimal gas overhead, approximately 3.3% for registration and 0.15% for voting, supporting system efficiency, auditability, and transparency. The experimental results confirm the system's scalability, efficiency, and resilience under concurrent loads. This approach offers a unified solution to address key challenges in voter authentication, data integrity, and quantum‑resilient security for digital systems.

Abstract:
Multimodal Face Anti‑Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross‑domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain‑specific inter‑modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self‑supervised task enhancing intrinsic, generalizable modal features via cross‑sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state‑of‑the‑art cross‑domain performance.

Abstract:
Developing reliable iris recognition and presentation attack detection methods requires diverse datasets that capture realistic variations in iris features and a wide spectrum of anomalies. Because of the rich texture of iris images, which spans a wide range of spatial frequencies, synthesizing same‑identity iris images while controlling specific attributes remains challenging. In this work, we introduce a new iris image augmentation strategy by traversing a generative model's latent space toward latent codes that represent same‑identity samples but with some desired iris image properties manipulated. The latent space traversal is guided by a gradient of specific geometrical, textural, or quality‑related iris image features (e.g., sharpness, pupil size, iris size, or pupil‑to‑iris ratio) and preserves the identity represented by the image being manipulated. The proposed approach can be easily extended to manipulate any attribute for which a differentiable loss term can be formulated. Additionally, our approach can use either randomly generated images using either a pre‑train GAN model or real‑world iris images. We can utilize GAN inversion to project any given iris image into the latent space and obtain its corresponding latent code.

Abstract:
The increasing virtualization of fifth generation (5G) networks expands the attack surface of the user plane, making spoofing a persistent threat to slice integrity and service reliability. This study presents a slice‑aware lightweight machine‑learning framework for detecting spoofing attacks within 5G network slices. The framework was implemented on a reproducible Open5GS and srsRAN testbed emulating three service classes such as enhanced Mobile Broadband (eMBB), Ultra‑Reliable Low‑Latency Communication (URLLC), and massive Machine‑Type Communication (mMTC) under controlled benign and adversarial traffic. Two efficient classifiers, Logistic Regression and Random Forest, were trained independently for each slice using statistical flow features derived from mirrored user‑plane traffic. Slice‑aware training improved detection accuracy by up to 5% and achieved F1‑scores between 0.93 and 0.96 while maintaining real‑time operation on commodity edge hardware. The results demonstrate that aligning security intelligence with slice boundaries enhances detection reliability and preserves operational isolation, enabling practical deployment in 5G network‑security environments. Conceptually, the work bridges network‑security architecture and adaptive machine learning by showing that isolation‑aware intelligence can achieve scalable, privacy‑preserving spoofing defense without high computational cost.

Abstract:
Face recognition systems are increasingly deployed across a wide range of applications, including smartphone authentication, access control, and border security. However, these systems remain vulnerable to presentation attacks (PAs), which can significantly compromise their reliability. In this work, we introduce a new dataset focused on a novel and realistic presentation attack instrument called Nylon Face Masks (NFMs), designed to simulate advanced 3D spoofing scenarios. NFMs are particularly concerning due to their elastic structure and photorealistic appearance, which enable them to closely mimic the victim's facial geometry when worn by an attacker. To reflect real‑world smartphone‑based usage conditions, we collected the dataset using an iPhone 11 Pro, capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four distinct presentation scenarios involving both humans and mannequins. We benchmark the dataset using five state‑of‑the‑art PAD methods to evaluate their robustness under unseen attack conditions. The results demonstrate significant performance variability across methods, highlighting the challenges posed by NFMs and underscoring the importance of developing PAD techniques that generalise effectively to emerging spoofing threats.

Abstract:
Remote identity verification is essential for modern digital security; however, it remains highly vulnerable to sophisticated Presentation Attacks (PAs) that utilise forged or manipulated identity documents. Although Deep Learning (DL) has driven advances in Presentation Attack Detection (PAD), the field is fundamentally limited by a lack of data and the poor generalisation of models across various document types and new attack methods. This article presents a systematic literature review (SLR) conducted in accordance with the PRISMA methodology, aiming to analyse and synthesise the current state of AI‑based PAD for identity documents from 2020 to 2025 comprehensively. Our analysis reveals a significant methodological evolution: a transition from standard Convolutional Neural Networks (CNNs) to specialised forensic micro‑artefact analysis, and more recently, the adoption of large‑scale Foundation Models (FMs), marking a substantial shift in the field. We identify a central paradox that hinders progress: a critical "Reality Gap" exists between models validated on extensive, private datasets and those assessed using limited public datasets, which typically consist of mock‑ups or synthetic data. This gap limits the reproducibility of research results. Additionally, we highlight a "Synthetic Utility Gap," where synthetic data generation the primary academic response to data scarcity often fails to predict forensic utility. This can lead to model overfitting to generation artefacts instead of the actual attack. This review consolidates our findings, identifies critical research gaps, and provides a definitive reference framework that outlines a prescriptive roadmap for future research aimed at developing secure, robust, and globally generalizable PAD systems.

Abstract:
The rapid deployment of unmanned aerial vehicle (UAV) corridors in sixth‑generation (6G) networks requires safe, intelligence‑driven integrated sensing and communications (ISAC). Reconfigurable intelligent surfaces (RIS) enhance spectrum efficiency, localisation accuracy, and situational awareness, while introducing new vulnerabilities. The rise of quantum computing increases the risks associated with harvest‑now‑decrypt‑later strategies and quantum‑enhanced spoofing. We propose a Quantum‑Resilient Threat Modelling (QRTM) framework for RIS‑assisted ISAC in UAV corridors to address these challenges. QRTM integrates classical, quantum‑ready, and quantum‑aided adversaries, countered using post‑quantum cryptographic (PQC) primitives: ML‑KEM for key establishment and Falcon for authentication, both embedded within RIS control signalling and UAV coordination. To strengthen security sensing, the framework introduces RIS‑coded scene watermarking validated through a generalised likelihood ratio test (GLRT), with its detection probability characterised by the Marcum Q function. Furthermore, a Secure ISAC Utility (SIU) jointly optimises secrecy rate, spoofing detection, and throughput under RIS constraints, enabled by a scheduler with computational complexity of O(n^2). Monte Carlo evaluations using 3GPP Release 19 mid‑band urban‑canyon models (7‑15 GHz) demonstrate a spoof‑detection probability approaching 0.99 at a false‑alarm rate of 1e‑3, secrecy‑rate retention exceeding 90 percent against quantum‑capable adversaries, and signal‑interference utilisation improvements of about 25 percent compared with baselines. These results show a standards‑compliant path towards reliable, quantum‑resilient ISAC for UAV corridors in smart cities and non‑terrestrial networks.

Abstract:
We present a data generation framework designed to simulate spoofing attacks and randomly place attack scenarios worldwide. We apply deep neural network‑based models for spoofing detection, utilizing Long Short‑Term Memory networks and Transformer‑inspired architectures. These models are specifically designed for online detection and are trained using the generated dataset. Our results demonstrate that deep learning models can accurately distinguish spoofed signals from genuine ones, achieving high detection performance. The best results are achieved by Transformer‑inspired architectures with early fusion of the inputs resulting in an error rate of 0.16%.

Abstract:
Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes ‑ their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL‑based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble‑CAM, is proposed for providing visual explanations for the decisions made by deep learning‑based face PAD systems. Our goal is to improve DL‑based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL‑based face PAD systems.

Abstract:
The growing prevalence of speech deepfakes has raised serious concerns, particularly in real‑world scenarios such as telephone fraud and identity theft. While many anti‑spoofing systems have demonstrated promising performance on lab‑generated synthetic speech, they often fail when confronted with physical replay attacks‑a common and low‑cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting‑edge zero‑shot text‑to‑speech (TTS) speech and physical replay recordings collected under varied devices and real‑world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real‑world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

Abstract:
GPS spoofing poses a growing threat to aviation by falsifying satellite signals and misleading aircraft navigation systems. This paper demonstrates a proof‑of‑concept spoofing detection strategy based on analyzing satellite Carrier‑to‑Noise Density Ratio (C/N_0) variation during controlled static antenna orientations. Using a u‑blox EVK‑M8U receiver and a GPSG‑1000 satellite simulator, C/N_0 data is collected under three antenna orientations flat, banked right, and banked left) in both real‑sky (non‑spoofed) and spoofed environments. Our findings reveal that under non‑spoofed signals, C/N_0 values fluctuate naturally with orientation, reflecting true geometric dependencies. However, spoofed signals demonstrate a distinct pattern: the flat orientation, which directly faces the spoofing antenna, consistently yielded the highest C/N_0 values, while both banked orientations showed reduced C/N_0 due to misalignment with the spoofing source. These findings suggest that simple maneuvers such as brief banking to induce C/N_0 variations can provide early cues of GPS spoofing for general aviation and UAV systems.

Abstract:
An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi‑domain Image Translative Diffusion StyleGAN (MID‑StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID‑StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi‑domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID‑StyleGAN outperforms existing methods in generating high‑quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.

Abstract:
Spoofing detection in financial trading is crucial, especially for identifying complex behaviors such as conspiracy spoofing. Traditional machine‑learning approaches primarily focus on isolated node features, often overlooking the broader context of interconnected nodes. Graph‑based techniques, particularly Graph Neural Networks (GNNs), have advanced the field by leveraging relational information effectively. However, in real‑world spoofing detection datasets, trading behaviors exhibit dynamic, irregular patterns. Existing spoofing detection methods, though effective in some scenarios, struggle to capture the complexity of dynamic and diverse, evolving inter‑node relationships. To address these challenges, we propose a novel framework called the Generative Dynamic Graph Model (GDGM), which models dynamic trading behaviors and the relationships among nodes to learn representations for conspiracy spoofing detection. Specifically, our approach incorporates the generative dynamic latent space to capture the temporal patterns and evolving market conditions. Raw trading data is first converted into time‑stamped sequences. Then we model trading behaviors using the neural ordinary differential equations and gated recurrent units, to generate the representation incorporating temporal dynamics of spoofing patterns. Furthermore, pseudo‑label generation and heterogeneous aggregation techniques are employed to gather relevant information and enhance the detection performance for conspiratorial spoofing behaviors. Experiments conducted on spoofing detection datasets demonstrate that our approach outperforms state‑of‑the‑art models in detection accuracy. Additionally, our spoofing detection system has been successfully deployed in one of the largest global trading markets, further validating the practical applicability and performance of the proposed method.

Abstract:
Recent multi‑modal face anti‑spoofing (FAS) methods have investigated the potential of leveraging multiple modalities to distinguish live and spoof faces. However, pre‑adapted multi‑modal FAS models often fail to detect unseen attacks from new target domains. Although a more realistic domain adaptation (DA) scenario has been proposed for single‑modal FAS to learn specific spoof attacks during inference, DA remains unexplored in multi‑modal FAS methods. In this paper, we propose a novel framework, MFAS‑DANet, to address three major challenges in multi‑modal FAS under the DA scenario: missing modalities, noisy pseudo labels, and model degradation. First, to tackle the issue of missing modalities, we propose extracting complementary features from other modalities to substitute missing modality features or enhance existing ones. Next, to reduce the impact of noisy pseudo labels during model adaptation, we propose deriving reliable pseudo labels by leveraging prediction uncertainty across different modalities. Finally, to prevent model degradation, we design an adaptive mechanism that decreases the loss weight during unstable adaptations and increasing it during stable ones. Extensive experiments demonstrate the effectiveness and state‑of‑the‑art performance of our proposed MFAS‑DANet.

Abstract:
With the rise of generative text‑to‑speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi‑dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best‑performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding‑based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

Abstract:
Current anti‑spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. Prosody is central to human expressiveness and emotion, and humans instinctively use prosodic cues such as F0 patterns and voiced/unvoiced structure to distinguish natural from synthetic speech. In this paper, we propose HuLA, a two‑stage prosody‑aware multi‑task learning framework for spoof detection. In Stage 1, a self‑supervised learning (SSL) backbone is trained on real speech with auxiliary tasks of F0 prediction and voiced/unvoiced classification, enhancing its ability to capture natural prosodic variation similar to human perceptual learning. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out‑of‑domain dataset, including expressive, emotional, and cross‑lingual attacks. These results demonstrate that explicit prosodic supervision, combined with SSL embeddings, substantially improves robustness against advanced synthetic speech attacks.

Abstract:
Cryptographic Ranging Authentication is here! We present initial results on the Pulsar authenticated ranging service broadcast from space with Pulsar‑0 utilizing a recording taken at Xona headquarters in Burlingame, CA. No assumptions pertaining to the ownership or leakage of encryption keys are required. This work discusses the Pulsar watermark design and security analysis. We derive the Pulsar watermark's probabilities of missed detection and false alarm, and we discuss the required receiver processing needed to utilize the Pulsar watermark. We present validation results of the Pulsar watermark utilizing the transmissions from orbit. Lastly, we provide results that demonstrate the spoofing detection efficacy with a spoofing scenario that incorporates the authentic transmissions from orbit. Because we make no assumption about the leakage of symmetric encryption keys, this work provides mathematical justification of the watermark's security, and our July 2025 transmissions from orbit, we claim the world's first authenticated satellite pseudorange from orbit.

Abstract:
In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM‑Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact‑related information, we trained self‑supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi‑scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand‑crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one‑class loss functions and provided optimized configurations to better align with the anti‑spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition.

Abstract:
In this paper, we introduce the discrete optimal transport voice conversion (kDOT‑VC) method. Comparison with kNN‑VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of our method. We use the probabilistic nature of optimal transport (OT) and show that kDOT‑VC is an effective black‑box adversarial attack against modern audio anti‑spoofing countermeasures (CMs). Our attack operates as a post‑processing, distribution‑alignment step: frame‑level WavLM embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top‑k barycentric projection, then decoded with a neural vocoder. Ablation analysis indicates that distribution‑level alignment is a powerful and stable attack for deployed CMs.

Abstract:
Foundation models such as CLIP have demonstrated exceptional zero‑ and few‑shot transfer capabilities across diverse vision tasks. However, when fine‑tuned for highly specialized biometric tasks, face recognition (FR), morphing attack detection (MAD), and presentation attack detection (PAD), these models may suffer from over‑specialization. Thus, they may lose one of their foundational strengths, cross‑domain generalization. In this work, we systematically quantify these trade‑offs by evaluating three instances of CLIP fine‑tuned for FR, MAD, and PAD. We evaluate each adapted model as well as the original CLIP baseline on 14 general vision datasets under zero‑shot and linear‑probe protocols, alongside common FR, MAD, and PAD benchmarks. Our results indicate that fine‑tuned models suffer from over‑specialization, especially when fine‑tuned for complex tasks of FR. Also, our results pointed out that task complexity and classification head design, multi‑class (FR) vs. binary (MAD and PAD), correlate with the degree of catastrophic forgetting. The FRoundation model with the ViT‑L backbone outperforms other approaches on the large‑scale FR benchmark IJB‑C, achieving an improvement of up to 58.52%. However, it experiences a substantial performance drop on ImageNetV2, reaching only 51.63% compared to 69.84% achieved by the baseline CLIP model. Moreover, the larger CLIP architecture consistently preserves more of the model's original generalization ability than the smaller variant, indicating that increased model capacity may help mitigate over‑specialization.

Abstract:
Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high‑security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti‑spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI‑generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state‑of‑the‑art (SOTA) models across multiple tasks, under simulated real‑world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next‑generation defenses that employ dynamic, context‑aware frameworks capable of evolving with the threat landscape.

Abstract:
Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural‑sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS‑R model as a front‑end feature extractor. For downstream back‑end classifier, we employ Multi‑kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state‑of‑the‑art performance on in‑domain benchmarks while generalizing robustly to out‑of‑domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.

Abstract:
Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false positives. In this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti‑spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack's face image into a live face image without inducing any perceptible visual alterations. Through experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.

Abstract:
Rapid advancements in generative modeling have made synthetic audio generation easy, making speech‑based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real‑world data. This study proposes a novel method for generalizable spoofing detection leveraging non‑semantic universal audio representations. Extensive experiments have been performed to find suitable non‑semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in‑domain test set while significantly outperforming state‑of‑the‑art approaches on out‑of‑domain test sets. Notably, it demonstrates superior generalization on public‑domain data, surpassing methods based on hand‑crafted features, semantic embeddings, and end‑to‑end architectures.

Abstract:
ICAO‑compliant facial images, initially designed for secure biometric passports, are increasingly becoming central to identity verification in a wide range of application contexts, including border control, digital travel credentials, and financial services. While their standardization enables global interoperability, it also facilitates practices such as morphing and deepfakes, which can be exploited for harmful purposes like identity theft and illegal sharing of identity documents. Traditional countermeasures like Presentation Attack Detection (PAD) are limited to real‑time capture and offer no post‑capture protection. This survey paper investigates digital watermarking and steganography as complementary solutions that embed tamper‑evident signals directly into the image, enabling persistent verification without compromising ICAO compliance. We provide the first comprehensive analysis of state‑of‑the‑art techniques to evaluate the potential and drawbacks of the underlying approaches concerning the applications involving ICAO‑compliant images and their suitability under standard constraints. We highlight key trade‑offs, offering guidance for secure deployment in real‑world identity systems.

Abstract:
Global Navigation Satellite Systems (GNSS) are critical for Positioning, Navigation, and Timing (PNT) applications. However, GNSS are highly vulnerable to spoofing attacks, where adversaries transmit counterfeit signals to mislead receivers. Such attacks can lead to severe consequences, including misdirected navigation, compromised data integrity, and operational disruptions. Most existing spoofing detection methods depend on supervised learning techniques and struggle to detect novel, evolved, and unseen attacks. To overcome this limitation, we develop a zero‑day spoofing detection method using a Hybrid Quantum‑Classical Autoencoder (HQC‑AE), trained solely on authentic GNSS signals without exposure to spoofed data. By leveraging features extracted during the tracking stage, our method enables proactive detection before PNT solutions are computed. We focus on spoofing detection in static GNSS receivers, which are particularly susceptible to time‑push spoofing attacks, where attackers manipulate timing information to induce incorrect time computations at the receiver. We evaluate our model against different unseen time‑push spoofing attack scenarios: simplistic, intermediate, and sophisticated. Our analysis demonstrates that the HQC‑AE consistently outperforms its classical counterpart, traditional supervised learning‑based models, and existing unsupervised learning‑based methods in detecting zero‑day, unseen GNSS time‑push spoofing attacks, achieving an average detection accuracy of 97.71% with an average false negative rate of 0.62% (when an attack occurs but is not detected). For sophisticated spoofing attacks, the HQC‑AE attains an accuracy of 98.23% with a false negative rate of 1.85%. These findings highlight the effectiveness of our method in proactively detecting zero‑day GNSS time‑push spoofing attacks across various stationary GNSS receiver platforms.

Abstract:
The WildSpoof Challenge aims to advance the use of in‑the‑wild data in two intertwined speech processing tasks. It consists of two parallel tracks: (1) Text‑to‑Speech (TTS) synthesis for generating spoofed speech, and (2) Spoofing‑robust Automatic Speaker Verification (SASV) for detecting spoofed speech. While the organizers coordinate both tracks and define the data protocols, participants treat them as separate and independent tasks. The primary objectives of the challenge are: (i) to promote the use of in‑the‑wild data for both TTS and SASV, moving beyond conventional clean and controlled datasets and considering real‑world scenarios; and (ii) to encourage interdisciplinary collaboration between the spoofing generation (TTS) and spoofing detection (SASV) communities, thereby fostering the development of more integrated, robust, and realistic systems.

Abstract:
Voice authentication has undergone significant changes from traditional systems that relied on handcrafted acoustic features to deep learning models that can extract robust speaker embeddings. This advancement has expanded its applications across finance, smart devices, law enforcement, and beyond. However, as adoption has grown, so have the threats. This survey presents a comprehensive review of the modern threat landscape targeting Voice Authentication Systems (VAS) and Anti‑Spoofing Countermeasures (CMs), including data poisoning, adversarial, deepfake, and adversarial spoofing attacks. We chronologically trace the development of voice authentication and examine how vulnerabilities have evolved in tandem with technological advancements. For each category of attack, we summarize methodologies, highlight commonly used datasets, compare performance and limitations, and organize existing literature using widely accepted taxonomies. By highlighting emerging risks and open challenges, this survey aims to support the development of more secure and resilient voice authentication systems.

Abstract:
Nowadays, the development of a Presentation Attack Detection (PAD) system for ID cards presents a challenge due to the lack of images available to train a robust PAD system and the increase in diversity of possible attack instrument species. Today, most algorithms focus on generating attack samples and do not take into account the limited number of bona fide images. This work is one of the first to propose a method for mimicking bona fide images by generating synthetic versions of them using Stable Diffusion, which may help improve the generalisation capabilities of the detector. Furthermore, the new images generated are evaluated in a system trained from scratch and in a commercial solution. The PAD system yields an interesting result, as it identifies our images as bona fide, which has a positive impact on detection performance and data restrictions.

Abstract:
In the rapidly evolving landscape of digital security, biometric authentication systems, particularly facial recognition, have emerged as integral components of various security protocols. However, the reliability of these systems is compromised by sophisticated spoofing attacks, where imposters gain unauthorized access by falsifying biometric traits. Current literature reveals a concerning gap: existing liveness detection methodologies ‑ designed to counteract these breaches ‑ fall short against advanced spoofing tactics employing deepfakes and other artificial intelligence‑driven manipulations. This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti‑spoofing techniques. By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision. Extensive evaluations were conducted across five diverse datasets, encompassing a wide range of attack vectors and environmental conditions. Results demonstrate substantial advancement over existing systems, with our best model (AttackNet V2.2) achieving 99.9% average accuracy when trained on combined data. Moreover, our research unveils critical insights into the behavioral patterns of impostor attacks, contributing to a more nuanced understanding of their evolving nature. The implications are profound: our models do not merely fortify the authentication processes but also instill confidence in biometric systems across various sectors reliant on secure access.

Abstract:
Can we teach machines to assess the expertise of humans solving visual tasks automatically based on eye tracking features? This paper proposes AutoSIGHT, Automatic System for Immediate Grading of Human experTise, that classifies expert and non‑expert performers, and builds upon an ensemble of features extracted from eye tracking data while the performers were solving a visual task. Results on the task of iris Presentation Attack Detection (PAD) used for this study show that with a small evaluation window of just 5 seconds, AutoSIGHT achieves an average average Area Under the ROC curve performance of 0.751 in subject‑disjoint train‑test regime, indicating that such detection is viable. Furthermore, when a larger evaluation window of up to 30 seconds is available, the Area Under the ROC curve (AUROC) increases to 0.8306, indicating the model is effectively leveraging more information at a cost of slightly delayed decisions. This work opens new areas of research on how to incorporate the automatic weighing of human and machine expertise into human‑AI pairing setups, which need to react dynamically to nonstationary expertise distribution between the human and AI players (e.g. when the experts need to be replaced, or the task at hand changes rapidly). Along with this paper, we offer the eye tracking data used in this study collected from 6 experts and 53 non‑experts solving iris PAD visual task.

Abstract:
Address Resolution Protocol (ARP) spoofing remains a critical threat to IoT networks, enabling attackers to intercept, modify, or disrupt data transmission by exploiting ARP's lack of authentication. The decentralized and resource‑constrained nature of IoT environments amplifies this vulnerability, making conventional detection mechanisms ineffective at scale. This paper introduces an intelligent, multi‑layered machine learning framework designed to detect ARP spoofing in real‑time IoT deployments. Our approach combines feature engineering based on ARP header behavior, traffic flow analysis, and temporal packet anomalies with a hybrid detection pipeline incorporating decision trees, ensemble models, and deep learning classifiers. We propose a hierarchical architecture to prioritize lightweight models at edge gateways and deeper models at centralized nodes to balance detection accuracy and computational efficiency. The system is validated on both simulated IoT traffic and the CICIDS2017 dataset, achieving over 97% detection accuracy with low false positive rates. Comparative evaluations with signature‑based and rule‑based systems demonstrate the robustness and generalizability of our approach. Our results show that intelligent machine learning integration enables proactive ARP spoofing detection tailored for IoT scenarios, laying the groundwork for scalable and autonomous network security solutions.

Authors: Juan E. Tapia, Mario Nieto, Juan M. Espin, Alvaro S. Rocamora, Javier Barrachina, Naser Damer, Christoph Busch, Marija Ivanovska, Leon Todorov, Renat Khizbullin, Lazar Lazarevich, Aleksei Grishin, Daniel Schulz, Sebastian Gonzalez, Amir Mohammadi, Ketan Kotwal, Sebastien Marcel, Raghavendra Mudgalgundurao, Kiran Raja, Patrick Schuch, Sushrut Patwardhan, Raghavendra Ramachandra, Pedro Couto Pereira, Joao Ribeiro Pinto, Mariana Xavier, Andrés Valenzuela, Rodrigo Lara, Borut Batagelj, Marko Peterlin, Peter Peer, Ajnas Muhammed, Diogo Nunes, Nuno Gonçalves

Abstract:
This work summarises and reports the results of the second Presentation Attack Detection competition on ID cards. This new version includes new elements compared to the previous one. (1) An automatic evaluation platform was enabled for automatic benchmarking; (2) Two tracks were proposed in order to evaluate algorithms and datasets, respectively; and (3) A new ID card dataset was shared with Track 1 teams to serve as the baseline dataset for the training and optimisation. The Hochschule Darmstadt, Fraunhofer‑IGD, and Facephi company jointly organised this challenge. 20 teams were registered, and 74 submitted models were evaluated. For Track 1, the "Dragons" team reached first place with an Average Ranking and Equal Error rate (EER) of AV‑Rank of 40.48% and 11.44% EER, respectively. For the more challenging approach in Track 2, the "Incode" team reached the best results with an AV‑Rank of 14.76% and 6.36% EER, improving on the results of the first edition of 74.30% and 21.87% EER, respectively. These results suggest that PAD on ID cards is improving, but it is still a challenging problem related to the number of images, especially of bona fide images.

Abstract:
Existing saliency‑guided training approaches improve model generalization by incorporating a loss term that compares the model's class activation map (CAM) for a sample's true‑class (\it i.e., correct‑label class) against a human reference saliency map. However, prior work has ignored the false‑class CAM(s), that is the model's saliency obtained for incorrect‑label class. We hypothesize that in binary tasks the true and false CAMs should diverge on the important classification features identified by humans (and reflected in human saliency maps). We use this hypothesis to motivate three new saliency‑guided training methods incorporating both true‑ and false‑class model's CAM into the training strategy and a novel post‑hoc tool for identifying important features. We evaluate all introduced methods on several diverse binary close‑set and open‑set classification tasks, including synthetic face detection, biometric presentation attack detection, and classification of anomalies in chest X‑ray scans, and find that the proposed methods improve generalization capabilities of deep learning models over traditional (true‑class CAM only) saliency‑guided training approaches. We offer source codes and model weights\footnoteGitHub repository link removed to preserve anonymity to support reproducible research.

Abstract:
Although face recognition systems have undergone an impressive evolution in the last decade, these technologies are vulnerable to attack presentations (AP). These attacks are mostly easy to create and, by executing them against the system's capture device, the malicious actor can impersonate an authorised subject and thus gain access to the latter's information (e.g., financial transactions). To protect facial recognition schemes against presentation attacks, state‑of‑the‑art deep learning presentation attack detection (PAD) approaches require a large amount of data to produce reliable detection performances and even then, they decrease their performance for unknown presentation attack instruments (PAI) or database (information not seen during training), i.e. they lack generalisability. To mitigate the above problems, this paper focuses on zero‑shot PAD. To do so, we first assess the effectiveness and generalisability of foundation models in established and challenging experimental scenarios and then propose a simple but effective framework for zero‑shot PAD. Experimental results show that these models are able to achieve performance in difficult scenarios with minimal effort of the more advanced PAD mechanisms, whose weights were optimised mainly with training sets that included APs and bona fide presentations. The top‑performing foundation model outperforms by a margin the best from the state of the art observed with the leaving‑one‑out protocol on the SiW‑Mv2 database, which contains challenging unknown 2D and 3D attacks

Abstract:
Autonomous unmanned aerial vehicles (UAVs) rely on global navigation satellite system (GNSS) pseudorange measurements for accurate real‑time localization and navigation. However, this dependence exposes them to sophisticated spoofing threats, where adversaries manipulate pseudoranges to deceive UAV receivers. Among these, drift‑evasive spoofing attacks subtly perturb measurements, gradually diverting the UAVs trajectory without triggering conventional signal‑level anti‑spoofing mechanisms. Traditional distributional shift detection techniques often require accumulating a threshold number of samples, causing delays that impede rapid detection and timely response. Consequently, robust temporal‑scale detection methods are essential to identify attack onset and enable contingency planning with alternative sensing modalities, improving resilience against stealthy adversarial manipulations. This study explores a Bayesian online change point detection (BOCPD) approach that monitors temporal shifts in value estimates from a reinforcement learning (RL) critic network to detect subtle behavioural deviations in UAV navigation. Experimental results show that this temporal value‑based framework outperforms conventional GNSS spoofing detectors, temporal semi‑supervised learning frameworks, and the Page‑Hinkley test, achieving higher detection accuracy and lower false‑positive and false‑negative rates for drift‑evasive spoofing attacks.

Abstract:
Automatic speaker verification (ASV) systems are often affected by spoofing attacks. Recent transformer‑based models have improved anti‑spoofing performance by learning strong feature representations. However, these models usually need high computing power. To address this, we introduce RawTFNet, a lightweight CNN model designed for audio signals. The RawTFNet separates feature processing along time and frequency dimensions, which helps to capture the fine‑grained details of synthetic speech. We tested RawTFNet on the ASVspoof 2021 LA and DF evaluation datasets. The results show that RawTFNet reaches comparable performance to that of the state‑of‑the‑art models, while also using fewer computing resources. The code and models will be made publicly available.

Abstract:
Multi‑modal face anti‑spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi‑modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single‑modal FAS. Furthermore, during the inference stage, multi‑modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross‑modal Transition‑guided Network (CTNet) to tackle the challenges in the multi‑modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross‑modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross‑modal feature transitions between live and spoof samples to effectively detect out‑of‑distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two‑class multi‑modal FAS methods across most protocols.

Abstract:
Partial audio deepfake localization poses unique challenges and remain underexplored compared to full‑utterance spoofing detection. While recent methods report strong in‑domain performance, their real‑world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures generalization and deployment readiness. We propose reframing the localization task as a sequential anomaly detection problem and advocate for the use of threshold‑dependent metrics such as accuracy, precision, recall, and F1‑score, which better reflect real‑world behavior. Specifically, we analyze the performance of the open‑source Coarse‑to‑Fine Proposal Refinement Framework (CFPRF), which achieves a 20‑ms EER of 7.61% on the in‑domain PartialSpoof evaluation set, but 43.25% and 27.59% on the LlamaPartialSpoof and Half‑Truth out‑of‑domain test sets. Interestingly, our reproduced version of the same model performs worse on in‑domain data (9.84%) but better on the out‑of‑domain sets (41.72% and 14.98%, respectively). This highlights the risks of over‑optimizing for in‑domain EER, which can lead to models that perform poorly in real‑world scenarios. It also suggests that while deep learning models can be effective on in‑domain data, they generalize poorly to out‑of‑domain scenarios, failing to detect novel synthetic samples and misclassifying unfamiliar bona fide audio. Finally, we observe that adding more bona fide or fully synthetic utterances to the training data often degrades performance, whereas adding partially fake utterances improves it.

Abstract:
Recently the emergence of novel presentation attacks has drawn increasing attention to face anti‑spoofing. However, existing methods tend to memorize data patterns from the training set, resulting in poor generalization to unknown attack types across different scenarios and limited interpretability. To address these challenges, this paper presents a reinforcement fine‑tuning‑based face anti‑spoofing method that stimulates the capabilities of multimodal large language models to think and learn how to solve the anti‑spoofing task itself, rather than relying on the memorization of authenticity patterns. We design verifiable class consistent reward and reasoning consistent reward, and employ a GRPO‑based optimization strategy to guide the model in exploring reasoning policies from multiple perspectives to maximize expected rewards. As a result, through iterative trial‑and‑error learning while retaining only high‑reward trajectories, the model distills highly generalizable decision‑making rules from the extensive solution space to effectively address cross‑domain face anti‑spoofing tasks. Extensive experimental results demonstrate that our method achieves state‑of‑the‑art cross‑domain generalization performance. It generalizes well to diverse unknown attack types in unseen target domains while providing interpretable reasoning for its authenticity decisions without requiring labor‑intensive textual annotations for training.

Abstract:
Healthcare 5.0 integrates Artificial Intelligence (AI), the Internet of Things (IoT), real‑time monitoring, and human‑centered design toward personalized medicine and predictive diagnostics. However, the increasing reliance on interconnected medical technologies exposes them to cyber threats. Meanwhile, current AI‑driven cybersecurity models often neglect biomedical data, limiting their effectiveness and interpretability. This study addresses this gap by applying eXplainable AI (XAI) to a Healthcare 5.0 dataset that integrates network traffic and biomedical sensor data. Classification outputs indicate that XGBoost achieved 99% F1‑score for benign and data alteration, and 81% for spoofing. Explainability findings reveal that network data play a dominant role in intrusion detection whereas biomedical features contributed to spoofing detection, with temperature reaching a Shapley values magnitude of 0.37.

Abstract:
Tele‑operated robots rely on real‑time user behavior mapping for remote tasks, but ensuring secure authentication remains a challenge. Traditional methods, such as passwords and static biometrics, are vulnerable to spoofing and replay attacks, particularly in high‑stakes, continuous interactions. This paper presents a novel anti‑spoofing and anti‑replay authentication approach that leverages distinctive user behavioral features extracted from haptic feedback during human‑robot interactions. To evaluate our authentication approach, we collected a time‑series force feedback dataset from 15 participants performing seven distinct tasks. We then developed a transformer‑based deep learning model to extract temporal features from the haptic signals. By analyzing user‑specific force dynamics, our method achieves over 90 percent accuracy in both user identification and task classification, demonstrating its potential for enhancing access control and identity assurance in tele‑robotic systems.

Abstract:
The limited or no protection for civilian Global Navigation Satellite System (GNSS) signals makes spoofing attacks relatively easy. With modern mobile devices often featuring network interfaces, state‑of‑the‑art signals of opportunity (SOP) schemes can provide accurate network positions in replacement of GNSS. The use of onboard inertial sensors can also assist in the absence of GNSS, possibly in the presence of jammers. The combination of SOP and inertial sensors has received limited attention, yet it shows strong results on fully custom‑built platforms. We do not seek to improve such special‑purpose schemes. Rather, we focus on countering GNSS attacks, notably detecting them, with emphasis on deployment with consumer‑grade platforms, notably smartphones, that provide off‑the‑shelf opportunistic information (i.e., network position and inertial sensor data). Our Position‑based Attack Detection Scheme (PADS) is a probabilistic framework that uses regression and uncertainty analysis for positions. The regression optimization problem is a weighted mean square error of polynomial fitting, with constraints that the fitted positions satisfy the device velocity and acceleration. Then, uncertainty is modeled by a Gaussian process, which provides more flexibility to analyze how sure or unsure we are about position estimations. In the detection process, we combine all uncertainty information with the position estimations into a fused test statistic, which is the input utilized by an anomaly detector based on outlier ensembles. The evaluation shows that the PADS outperforms a set of baseline methods that rely on SOP or inertial sensor‑based or statistical tests, achieving up to 3 times the true positive rate at a low false positive rate.

Abstract:
Spoofed utterances always contain artifacts introduced by generative models. While several countermeasures have been proposed to detect spoofed utterances, most primarily focus on architectural improvements. In this work, we investigate how artifacts remain hidden in spoofed speech and how to enhance their presence. We propose a model‑agnostic pipeline that amplifies artifacts using speech enhancement and various types of noise. Our approach consists of three key steps: noise addition, noise extraction, and noise amplification. First, we introduce noise into the raw speech. Then, we apply speech enhancement to extract the entangled noise and artifacts. Finally, we amplify these extracted features. Moreover, our pipeline is compatible with different speech enhancement models and countermeasure architectures. Our method improves spoof detection performance by up to 44.44% on ASVspoof2019 and 26.34% on ASVspoof2021.

Abstract:
Quantization is essential for deploying large audio language models (LALMs) efficiently in resource‑constrained environments. However, its impact on complex tasks, such as zero‑shot audio spoofing detection, remains underexplored. This study evaluates the zero‑shot capabilities of five LALMs, GAMA, LTU‑AS, MERaLiON, Qwen‑Audio, and SALMONN, across three distinct datasets: ASVspoof2019, In‑the‑Wild, and WaveFake, and investigates their robustness to quantization (FP32, FP16, INT8). Despite high initial spoof detection accuracy, our analysis demonstrates severe predictive biases toward spoof classification across all models, rendering their practical performance equivalent to random classification. Interestingly, quantization to FP16 precision resulted in negligible performance degradation compared to FP32, effectively halving memory and computational requirements without materially impacting accuracy. However, INT8 quantization intensified model biases, significantly degrading balanced accuracy. These findings highlight critical architectural limitations and emphasize FP16 quantization as an optimal trade‑off, providing guidelines for practical deployment and future model refinement.

Abstract:
Nowadays, one of the main challenges in presentation attack detection (PAD) on ID cards is obtaining generalisation capabilities for a diversity of countries that are issuing ID cards. Most PAD systems are trained on one, two, or three ID documents because of privacy protection concerns. As a result, they do not obtain competitive results for commercial purposes when tested in an unknown new ID card country. In this scenario, Foundation Models (FM) trained on huge datasets can help to improve generalisation capabilities. This work intends to improve and benchmark the capabilities of FM and how to use them to adapt the generalisation on PAD of ID Documents. Different test protocols were used, considering zero‑shot and fine‑tuning and two different ID card datasets. One private dataset based on Chilean IDs and one open‑set based on three ID countries: Finland, Spain, and Slovakia. Our findings indicate that bona fide images are the key to generalisation.

Abstract:
This paper addresses source tracing in synthetic speech‑identifying generative systems behind manipulated audio via speaker recognition‑inspired pipelines. While prior work focuses on spoofing detection, source tracing lacks robust solutions. We evaluate two approaches: classification‑based and metric‑learning. We tested our methods on the MLAADv5 benchmark using ResNet and self‑supervised learning (SSL) backbones. The results show that ResNet achieves competitive performance with the metric learning approach, matching and even exceeding SSL‑based systems. Our work demonstrates ResNet's viability for source tracing while underscoring the need to optimize SSL representations for this task. Our work bridges speaker recognition methodologies with audio forensic challenges, offering new directions for combating synthetic media manipulation.

Abstract:
Face Anti‑Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image‑text understanding and semantic reasoning, suggesting that integrating visual and linguistic co‑inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high‑quality vision‑language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain‑of‑Thought), the first large‑scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high‑quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT‑Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state‑of‑the‑art methods on multiple benchmark datasets.

Abstract:
The advent of foundation models, particularly Vision‑Language Models (VLMs) and Multi‑modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero‑shot and few‑shot performance of state‑of‑the‑art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine‑tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD‑R‑Full dataset was 97.55 percent without any fine‑tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long‑term vision of Artificial General Intelligence.

Abstract:
Recent advances in text‑to‑speech technologies have enabled realistic voice generation, fueling audio‑based deepfake attacks such as fraud and impersonation. While audio anti‑spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic‑level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the linguistic sensitivity of both open‑source and commercial anti‑spoofing detectors by introducing transcript‑level adversarial attacks. Our extensive evaluation reveals that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates surpass 60% on several open‑source detector‑voice pairs, and notably one commercial detection accuracy drops from 100% on synthetic audio to just 32%. Through a comprehensive feature attribution analysis, we identify that both linguistic complexity and model‑level audio embedding similarity contribute strongly to detector vulnerability. We further demonstrate the real‑world risk via a case study replicating the Brad Pitt audio deepfake scam, using transcript adversarial attacks to completely bypass commercial detectors. These results highlight the need to move beyond purely acoustic defenses and account for linguistic variation in the design of robust anti‑spoofing systems. All source code will be publicly available.

Abstract:
Recent advances in neural audio codec‑based speech generation (CoSG) models have produced remarkably realistic audio deepfakes. We refer to deepfake speech generated by CoSG systems as codec‑based deepfake, or CodecFake. Although existing anti‑spoofing research on CodecFake predominantly focuses on verifying the authenticity of audio samples, almost no attention was given to tracing the CoSG used in generating these deepfakes. In CodecFake generation, processes such as speech‑to‑unit encoding, discrete unit modeling, and unit‑to‑speech decoding are fundamentally based on neural audio codecs. Motivated by this, we introduce source tracing for CodecFake via neural audio codec taxonomy, which dissects neural audio codecs to trace CoSG. Our experimental results on the CodecFake+ dataset provide promising initial evidence for the feasibility of CodecFake source tracing while also highlighting several challenges that warrant further investigation.

Abstract:
Face Anti‑Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios such as payment processing and surveillance. Current multimodal FAS methods often struggle with effective generalization, mainly due to modality‑specific biases and domain shifts. To address these challenges, we introduce the Multimodal Denoising and Alignment (MMDA) framework. By leveraging the zero‑shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data through denoising and alignment mechanisms, thereby significantly enhancing the generalization performance of cross‑modal alignment. The Modality‑Domain Joint Differential Attention (MD2A) module in MMDA concurrently mitigates the impacts of domain and modality noise by refining the attention mechanism based on extracted common noise features. Furthermore, the Representation Space Soft (RS2) Alignment strategy utilizes the pre‑trained CLIP model to align multi‑domain multimodal data into a generalized representation space in a flexible manner, preserving intricate representations and enhancing the model's adaptability to various unseen conditions. We also design a U‑shaped Dual Space Adaptation (U‑DSA) module to enhance the adaptability of representations while maintaining generalization performance. These improvements not only enhance the framework's generalization capabilities but also boost its ability to represent complex representations. Our experimental results on four benchmark datasets under different evaluation protocols demonstrate that the MMDA framework outperforms existing state‑of‑the‑art methods in terms of cross‑domain generalization and multimodal detection accuracy. The code will be released soon.

Abstract:
The demand for Presentation Attack Detection (PAD) to identify fraudulent ID documents in remote verification systems has significantly risen in recent years. This increase is driven by several factors, including the rise of remote work, online purchasing, migration, and advancements in synthetic images. Additionally, we have noticed a surge in the number of attacks aimed at the enrolment process. Training a PAD to detect fake ID documents is very challenging because of the limited number of ID documents available due to privacy concerns. This work proposes a new passport dataset generated from a hybrid method that combines synthetic data and open‑access information using the ICAO requirement to obtain realistic training and testing images.

Abstract:
Global navigation satellite systems (GNSS) are vulnerable to spoofing attacks, with adversarial signals manipulating the location or time information of receivers, potentially causing severe disruptions. The task of discerning the spoofing signals from benign ones is naturally relevant for machine learning, thus recent interest in applying it for detection. While deep learning‑based methods are promising, they require extensive labeled datasets, consume significant computational resources, and raise privacy concerns due to the sensitive nature of position data. This is why this paper proposes a self‑supervised federated learning framework for GNSS spoofing detection. It consists of a cloud server and local mobile platforms. Each mobile platform employs a self‑supervised anomaly detector using long short‑term memory (LSTM) networks. Labels for training are generated locally through a spoofing‑deviation prediction algorithm, ensuring privacy. Local models are trained independently, and only their parameters are uploaded to the cloud server, which aggregates them into a global model using FedAvg. The updated global model is then distributed back to the mobile platforms and trained iteratively. The evaluation shows that our self‑supervised federated learning framework outperforms position‑based and deep learning‑based methods in detecting spoofing attacks while preserving data privacy.

Abstract:
Face anti‑spoofing is a critical technology for ensuring the security of face recognition systems. However, its ability to generalize across diverse scenarios remains a significant challenge. In this paper, we attribute the limited generalization ability to two key factors: covariate shift, which arises from external data collection variations, and semantic shift, which results from substantial differences in emerging attack types. To address both challenges, we propose a novel approach for learning unknown spoof prompts, relying solely on real face images from a single source domain. Our method generates textual prompts for real faces and potential unknown spoof attacks by leveraging the general knowledge embedded in vision‑language models, thereby enhancing the model's ability to generalize to unseen target domains. Specifically, we introduce a diverse spoof prompt optimization framework to learn effective prompts. This framework constrains unknown spoof prompts within a relaxed prior knowledge space while maximizing their distance from real face images. Moreover, it enforces semantic independence among different spoof prompts to capture a broad range of spoof patterns. Experimental results on nine datasets demonstrate that the learned prompts effectively transfer the knowledge of vision‑language models, enabling state‑of‑the‑art generalization ability against diverse unknown attack types across unseen target domains without using any spoof face images.

Abstract:
3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection‑related text descriptions offer concise, universal information and are cost‑effective to obtain. However, the potential of vision‑language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge‑based prompt learning framework to explore the strong generalization capability of vision‑language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine‑grained, task‑specific explicit prompts that effectively harness the knowledge embedded in pre‑trained vision‑language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual‑specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category‑irrelevant local image patches using guidance from knowledge‑based text features, fostering the learning of generalized causal prompts that align with category‑relevant local patches. Experimental results demonstrate that the proposed method achieves state‑of‑the‑art intra‑ and cross‑scenario detection performance on benchmark datasets.

Abstract:
Saliency‑guided training, which directs model learning to important regions of images, has demonstrated generalization improvements across various biometric presentation attack detection (PAD) tasks. This paper presents its first application to fingerprint PAD. We conducted a 50‑participant study to create a dataset of 800 human‑annotated fingerprint perceptually‑important maps, explored alongside algorithmically‑generated "pseudosaliency," including minutiae‑based, image quality‑based, and autoencoder‑based saliency maps. Evaluating on the 2021 Fingerprint Liveness Detection Competition testing set, we explore various configurations within five distinct training scenarios to assess the impact of saliency‑guided training on accuracy and generalization. Our findings demonstrate the effectiveness of saliency‑guided training for fingerprint PAD in both limited and large data contexts, and we present a configuration capable of earning the first place on the LivDet‑2021 benchmark. Our results highlight saliency‑guided training's promise for increased model generalization capabilities, its effectiveness when data is limited, and its potential to scale to larger datasets in fingerprint PAD. All collected saliency data and trained models are released with the paper to support reproducible research.

Abstract:
Despite several algorithmic advances in the training of convolutional neural networks (CNNs) over the years, their generalization capabilities are still subpar across several pertinent domains, particularly within open‑set tasks often found in biometric and medical contexts. On the contrary, humans have an uncanny ability to generalize to unknown visual stimuli. The efficient coding hypothesis posits that early visual structures (retina, Lateral Geniculate Nucleus, and primary visual cortex) transform inputs to reduce redundancy and maximize information efficiency. This mechanism of redundancy minimization in early vision was the inspiration for CNN regularization techniques that force convolutional kernels to be orthogonal. However, the existing works rely upon matrix projections, architectural modifications, or specific weight initializations, which frequently overtly constrain the network's learning process and excessively increase the computational load during loss function calculation. In this paper, we introduce a flexible and lightweight approach that regularizes a subset of first‑layer convolutional filters by making them pairwise‑orthogonal, which reduces the redundancy of the extracted features but at the same time prevents putting excessive constraints on the network. We evaluate the proposed method on three open‑set visual tasks (anomaly detection in chest X‑ray images, synthetic face detection, and iris presentation attack detection) and observe an increase in the generalization capabilities of models trained with the proposed regularizer compared to state‑of‑the‑art kernel orthogonalization approaches. We offer source codes along with the paper.

Abstract:
Presentation Attack Detection (PAD) systems are usually designed independently of the fingerprint verification system. While this can be acceptable for use cases where specific user templates are not predetermined, it represents a missed opportunity to enhance security in scenarios where integrating PAD with the fingerprint verification system could significantly leverage users' templates, which are the real target of a potential presentation attack. This does not mean that a PAD should be specifically designed for such users; that would imply the availability of many enrolled users' PAI and, consequently, complexity, time, and cost increase. On the contrary, we propose to equip a basic PAD, designed according to the state of the art, with an innovative add‑on module called the Closeness Binary Code (CC) module. The term "closeness" refers to a peculiar property of the bona fide‑related features: in an Euclidean feature space, genuine fingerprints tend to cluster in a specific pattern. First, samples from the same finger are close to each other, then samples from other fingers of the same user and finally, samples from fingers of other users. This property is statistically verified in our previous publication, and further confirmed in this paper. It is independent of the user population and the feature set class, which can be handcrafted or deep network‑based (embeddings). Therefore, the add‑on can be designed without the need for the targeted user samples; moreover, it exploits her/his samples' "closeness" property during the verification stage. Extensive experiments on benchmark datasets and state‑of‑the‑art PAD methods confirm the benefits of the proposed add‑on, which can be easily coupled with the main PAD module integrated into the fingerprint verification system.

Abstract:
Recent video generation research has focused heavily on isolated actions, leaving interactive motions‑such as hand‑face interactions‑largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion‑based anti‑spoofing approaches. From a security perspective, there is a growing need for large‑scale, high‑quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand‑face interactions. Our approach simultaneously learns spatio‑temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision‑free contact. To facilitate this research, we present InterHF, a large‑scale hand‑face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region‑aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region‑aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large‑scale effort to systematically study human hand‑face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.

Abstract:
Face recognition systems are vulnerable to physical attacks (e.g., printed photos) and digital threats (e.g., DeepFake), which are currently being studied as independent visual tasks, such as Face Anti‑Spoofing and Forgery Detection. The inherent differences among various attack types present significant challenges in identifying a common feature space, making it difficult to develop a unified framework for detecting data from both attack modalities simultaneously. Inspired by the efficacy of Mixture‑of‑Experts (MoE) in learning across diverse domains, we explore utilizing multiple experts to learn the distinct features of various attack types. However, the feature distributions of physical and digital attacks overlap and differ. This suggests that relying solely on distinct experts to learn the unique features of each attack type may overlook shared knowledge between them. To address these issues, we propose SUEDE, the Shared Unified Experts for Physical‑Digital Face Attack Detection Enhancement. SUEDE combines a shared expert (always activated) to capture common features for both attack types and multiple routed experts (selectively activated) for specific attack types. Further, we integrate CLIP as the base network to ensure the shared expert benefits from prior visual knowledge and align visual‑text representations in a unified space. Extensive results demonstrate SUEDE achieves superior performance compared to state‑of‑the‑art unified detection methods.

Abstract:
The challenge of Domain Generalization (DG) in Face Anti‑Spoofing (FAS) is the significant interference of domain‑specific signals on subtle spoofing clues. Recently, some CLIP‑based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class‑wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class‑wise prompts, we propose a novel Content‑aware Composite Prompt Engineering (CCPE) that generates instance‑wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content‑aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction‑based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q‑Former. Moreover, we design a Cross‑Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross‑domain experiments and achieves state‑of‑the‑art (SOTA) results.

Abstract:
Although contactless fingerprints offer user comfort, they are more vulnerable to spoofing. The current solution for anti‑spoofing in the area of contactless fingerprints relies on domain adaptation learning, limiting their generalization and scalability. To address these limitations, we introduce GRU‑AUNet, a domain adaptation approach that integrates a Swin Transformer‑based UNet architecture with GRU‑enhanced attention mechanisms, a Dynamic Filter Network in the bottleneck, and a combined Focal and Contrastive Loss function. Trained in both genuine and spoof fingerprint images, GRU‑AUNet demonstrates robust resilience against presentation attacks, achieving an average BPCER of 0.09% and APCER of 1.2% in the CLARKSON, COLFISPOOF, and IIITD datasets, outperforming state‑of‑the‑art domain adaptation methods.

Abstract:
Developing a face anti‑spoofing model that meets the security requirements of clients worldwide is challenging due to the domain gap between training datasets and diverse end‑user test data. Moreover, for security and privacy reasons, it is undesirable for clients to share a large amount of their face data with service providers. In this work, we introduce a novel method in which the face anti‑spoofing model can be adapted by the client itself to a target domain at test time using only a small sample of data while keeping model parameters and training data inaccessible to the client. Specifically, we develop a prototype‑based base model and an optimal transport‑guided adaptor that enables adaptation in either a lightweight training or training‑free fashion, without updating base model's parameters. Furthermore, we propose geodesic mixup, an optimal transport‑based synthesis method that generates augmented training data along the geodesic path between source prototypes and target data distribution. This allows training a lightweight classifier to effectively adapt to target‑specific characteristics while retaining essential knowledge learned from the source domain. In cross‑domain and cross‑attack settings, compared with recent methods, our method achieves average relative improvements of 19.17% in HTER and 8.58% in AUC, respectively.

Abstract:
Face anti‑spoofing (FAS) heavily relies on identifying live/spoof discriminative features to counter face presentation attacks. Recently, we proposed LDCformer to successfully incorporate the Learnable Descriptive Convolution (LDC) into ViT, to model long‑range dependency of locally descriptive features for FAS. In this paper, we propose three novel training strategies to effectively enhance the training of LDCformer to largely boost its feature characterization capability. The first strategy, dual‑attention supervision, is developed to learn fine‑grained liveness features guided by regional live/spoof attentions. The second strategy, self‑challenging supervision, is designed to enhance the discriminability of the features by generating challenging training data. In addition, we propose a third training strategy, transitional triplet mining strategy, through narrowing the cross‑domain gap while maintaining the transitional relationship between live and spoof features, to enlarge the domain‑generalization capability of LDCformer. Extensive experiments show that LDCformer under joint supervision of the three novel training strategies outperforms previous methods.

Abstract:
Face anti‑spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two‑class FAS methods risk overfitting to training attacks to achieve better performance, one‑class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (UFDANet), a one‑class FAS technique that enhances generalizability by augmenting face images via disentangled features. The UFDANet employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out‑of‑distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, UFDANet incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed UFDANet outperforms previous one‑class FAS methods and achieves comparable performance to state‑of‑the‑art two‑class FAS methods.

Abstract:
Face anti‑spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision‑language pretrained (VLP) models, recent two‑class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one‑class FAS methods. The one‑class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one‑class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof‑aware one‑class face anti‑spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof‑attack‑related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language‑guided spoof cue map estimation to enhance one‑class FAS models by simulating whether the underlying faces are covered by attack‑related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt‑driven liveness feature disentanglement to alleviate live/spoof‑irrelative domain variations by disentangling live/spoof‑relevant and domain‑dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof‑like image features and thus diversify latent spoof features to facilitate the learning of one‑class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one‑class FAS methods.

Abstract:
Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.

Abstract:
Finger photo Presentation Attack Detection (PAD) can significantly strengthen smartphone device security. However, these algorithms are trained to detect certain types of attacks. Furthermore, they are designed to operate on images acquired by specific capture devices, leading to poor generalization and a lack of robustness in handling the evolving nature of mobile hardware. The proposed investigation is the first to systematically analyze the performance degradation of existing deep learning PAD systems, convolutional and transformers, in cross‑capture device settings. In this paper, we introduce the ColFigPhotoAttnNet architecture designed based on window attention on color channels, followed by the nested residual network as the predictor to achieve a reliable PAD. Extensive experiments using various capture devices, including iPhone13 Pro, GooglePixel 3, Nokia C5, and OnePlusOne, were carried out to evaluate the performance of proposed and existing methods on three publicly available databases. The findings underscore the effectiveness of our approach.

Abstract:
In environmental protection, tree monitoring plays an essential role in maintaining and improving ecosystem health. However, precise monitoring is challenging because existing datasets fail to capture continuous fine‑grained changes in trees due to low‑resolution images and high acquisition costs. In this paper, we introduce UAVTC, a large‑scale, long‑term, high‑resolution dataset collected using UAVs equipped with cameras, specifically designed to detect individual Tree Changes (TCs). UAVTC includes rich annotations and statistics based on biological knowledge, offering a fine‑grained view for tree monitoring. To address environmental influences and effectively model the hierarchical diversity of physiological TCs, we propose a novel Hyperbolic Siamese Network (HSN) for TC detection, enabling compact and hierarchical representations of dynamic tree changes. Extensive experiments show that HSN can effectively capture complex hierarchical changes and provide a robust solution for fine‑grained TC detection. In addition, HSN generalizes well to cross‑domain face anti‑spoofing task, highlighting its broader significance in AI. We believe our work, combining ecological insights and interdisciplinary expertise, will benefit the community by offering a new benchmark and innovative AI technologies.

Abstract:
3D vision is of paramount importance for numerous applications ranging from machine intelligence to precision metrology. Despite much recent progress, the majority of 3D imaging hardware remains bulky and complicated and provides much lower image resolution compared to their 2D counterparts. Moreover, there are many well‑known scenarios that existing 3D imaging solutions frequently fail. Here, we introduce an extended monocular 3D imaging (EM3D) framework that fully exploits the vectorial wave nature of light. Via the multi‑stage fusion of diffraction‑ and polarization‑based depth cues, using a compact monocular camera equipped with a diffractive‑refractive hybrid lens, we experimentally demonstrate the snapshot acquisition of a million‑pixel and accurate 3D point cloud for extended scenes that are traditionally challenging, including those with low texture, being highly reflective, or nearly transparent, without a data prior. Furthermore, we discover that the combination of depth and polarization information can unlock unique new opportunities in material identification, which may further expand machine intelligence for applications like target recognition and face anti‑spoofing. The straightforward yet powerful architecture thus opens up a new path for a higher‑dimensional machine vision in a minimal form factor, facilitating the deployment of monocular cameras for applications in much more diverse scenarios.

Abstract:
Global Navigation Satellite Systems enable precise localization and timing even for highly mobile devices, but legacy implementations provide only limited support for the new generation of security‑enhanced signals. Inertial Measurement Units have proved successful in augmenting the accuracy and robustness of the GNSS‑provided navigation solution, but effective navigation based on inertial techniques in denied contexts requires high‑end sensors. However, commercially available mobile devices usually embed a much lower‑grade inertial system. To counteract an attacker transmitting all the adversarial signals from a single antenna, we exploit carrier phase‑based observations coupled with a low‑end inertial sensor to identify spoofing and meaconing. By short‑time integration with an inertial platform, which tracks the displacement of the GNSS antenna, the high‑frequency movement at the receiver is correlated with the variation in the carrier phase. In this way, we identify legitimate transmitters, based on their geometrical diversity with respect to the antenna system movement. We introduce a platform designed to effectively compare different tiers of commercial INS platforms with a GNSS receiver. By characterizing different inertial sensors, we show that simple MEMS INS perform as well as high‑end industrial‑grade sensors. Sensors traditionally considered unsuited for navigation purposes offer great performance at the short integration times used to evaluate the carrier phase information consistency against the high‑frequency movement. Results from laboratory evaluation and through field tests at Jammertest 2024 show that the detector is up to 90% accurate in correctly identifying spoofing (or the lack of it), without any modification to the receiver structure, and with mass‑production grade INS typical for mobile phones.

Abstract:
This study highlights the potential of ChatGPT (specifically GPT‑4o) as a competitive alternative for Face Presentation Attack Detection (PAD), outperforming several PAD models, including commercial solutions, in specific scenarios. Our results show that GPT‑4o demonstrates high consistency, particularly in few‑shot in‑context learning, where its performance improves as more examples are provided (reference data). We also observe that detailed prompts enable the model to provide scores reliably, a behavior not observed with concise prompts. Additionally, explanation‑seeking prompts slightly enhance the model's performance by improving its interpretability. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few‑shot scenarios, despite not being explicitly instructed to classify attack types. Despite these strengths, GPT‑4o faces challenges in zero‑shot tasks, where its performance is limited compared to specialized PAD systems. Experiments were conducted on a subset of the SOTERIA dataset, ensuring compliance with data privacy regulations by using only data from consenting individuals. These findings underscore GPT‑4o's promise in PAD applications, laying the groundwork for future research to address broader data privacy concerns and improve cross‑dataset generalization. Code available here: https://gitlab.idiap.ch/bob/bob.paper.wacv2025_chatgpt_face_pad

Abstract:
With the rapid advancement of neural audio codecs, codec‑based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large‑scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re‑synthesis using 31 publicly available open‑source codec models, while the evaluation set includes web‑sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re‑synthesized speech (CoRS) as training data for large‑scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re‑synthesis model incorporates disentanglement auxiliary objectives or a frequency‑domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine‑grained exploration to develop better anti‑spoofing models against CodecFake.

Abstract:
Foundation models are becoming increasingly popular due to their strong generalization capabilities resulting from being trained on huge datasets. These generalization capabilities are attractive in areas such as NIR Iris Presentation Attack Detection (PAD), in which databases are limited in the number of subjects and diversity of attack instruments, and there is no correspondence between the bona fide and attack images because, most of the time, they do not belong to the same subjects. This work explores an iris PAD approach based on two foundation models, DinoV2 and VisualOpenClip. The results show that fine‑tuning prediction with a small neural network as head overpasses the state‑of‑the‑art performance based on deep learning approaches. However, systems trained from scratch have still reached better results if bona fide and attack images are available.

Abstract:
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut‑and‑paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript3T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re‑implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut‑and‑paste methods. Despite human difficulty, experimental results demonstrate that self‑supervised‑based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.

Abstract:
The increasing reliance on Global Navigation Satellite Systems (GNSS), particularly the Global Positioning System (GPS), underscores the urgent need to safeguard these technologies against malicious threats such as spoofing and jamming. As the backbone for positioning, navigation, and timing (PNT) across various applications including transportation, telecommunications, and emergency services GNSS is vulnerable to deliberate interference that poses significant risks. Spoofing attacks, which involve transmitting counterfeit GNSS signals to mislead receivers into calculating incorrect positions, can result in serious consequences, from navigational errors in civilian aviation to security breaches in military operations. Furthermore, the lack of inherent security measures within GNSS systems makes them attractive targets for adversaries. While GNSS/GPS jamming and spoofing systems consist of numerous components, the ability to distinguish authentic signals from malicious ones is essential for maintaining system integrity. Recent advancements in machine learning and deep learning provide promising avenues for enhancing detection and mitigation strategies against these threats. This paper addresses both spoofing and jamming by tackling real‑world challenges through machine learning, deep learning, and computer vision techniques. Through extensive experiments on two real‑world datasets related to spoofing and jamming detection using advanced algorithms, we achieved state of the art results. In the GNSS/GPS jamming detection task, we attained approximately 99% accuracy, improving performance by around 5% compared to previous studies. Additionally, we addressed a challenging tasks related to spoofing detection, yielding results that underscore the potential of machine learning and deep learning in this domain.

Abstract:
Face Anti‑Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out‑of‑domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti‑Spoofing (I‑FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof‑aware Captioning and Filtering (SCF) strategy to generate high‑quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L‑LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi‑level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross‑domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state‑of‑the‑art methods.

Abstract:
This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker‑related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker‑related attributes as target labels. These attributes are categorized into two groups: metadata‑based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.

Abstract:
With the rapid growth usage of face recognition in people's daily life, face anti‑spoofing becomes increasingly important to avoid malicious attacks. Recent face anti‑spoofing models can reach a high classification accuracy on multiple datasets but these models can only tell people "this face is fake" while lacking the explanation to answer "why it is fake". Such a system undermines trustworthiness and causes user confusion, as it denies their requests without providing any explanations. In this paper, we incorporate XAI into face anti‑spoofing and propose a new problem termed X‑FAS (eXplainable Face Anti‑Spoofing) empowering face anti‑spoofing models to provide an explanation. We propose SPTD (SPoof Trace Discovery), an X‑FAS method which can discover spoof concepts and provide reliable explanations on the basis of discovered concepts. To evaluate the quality of X‑FAS methods, we propose an X‑FAS benchmark with annotated spoof traces by experts. We analyze SPTD explanations on face anti‑spoofing dataset and compare SPTD quantitatively and qualitatively with previous XAI methods on proposed X‑FAS benchmark. Experimental results demonstrate SPTD's ability to generate reliable explanations.

Abstract:
Detecting spoofing attacks to Low‑Earth‑Orbit (LEO) satellite systems is a cornerstone to assessing the authenticity of the received information and guaranteeing robust service delivery in several application domains. The solutions available today for spoofing detection either rely on additional communication systems, receivers, and antennas, or require mobile deployments. Detection systems working at the Physical (PHY) layer of the satellite communication link also require time‑consuming and energy‑hungry training processes on all satellites of the constellation, and rely on the availability of spoofed data, which are often challenging to collect. Moreover, none of such contributions investigate the feasibility of aerial spoofing attacks launched via drones operating at various altitudes. In this paper, we propose a new spoofing detection technique for LEO satellite constellation systems, applying anomaly detection on the received PHY signal via autoencoders. We validate our solution through an extensive measurement campaign involving the deployment of an actual spoofer (Software‑Defined Radio) installed on a drone and injecting rogue IRIDIUM messages while flying at different altitudes with various movement patterns. Our results demonstrate that the proposed technique can reliably detect LEO spoofing attacks launched at different altitudes, while state‑of‑the‑art competing approaches simply fail. We also release the collected data as open source, fostering further research on satellite security.

Abstract:
This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self‑supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR‑P masking, which explicitly forces the model to capture meaningful intra‑region consistency and challenging inter‑region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local‑to‑global correspondence via tailored self‑distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross‑dataset deepfake detection, cross‑domain face anti‑spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self‑supervised learning arts, and even outperforms task‑specialized SOTA methods.

Abstract:
Voice authentication on IoT‑enabled smart devices has gained prominence in recent years due to increasing concerns over user privacy and security. The current authentication systems are vulnerable to different voice‑spoofing attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic legitimate voices to deceive authentication systems and enable fraudulent activities (e.g., impersonation, unauthorized access, financial fraud, etc.). Existing solutions are often designed to tackle a single type of attack, leading to compromised performance against unseen attacks. On the other hand, existing unified voice anti‑spoofing solutions, not designed specifically for IoT, possess complex architectures and thus cannot be deployed on IoT‑enabled smart devices. Additionally, most of these unified solutions exhibit significant performance issues, including higher equal error rates or lower accuracy for specific attacks. To overcome these issues, we present the parallel stacked aggregation network (PSA‑Net), a lightweight framework designed as an anti‑spoofing defense system for voice‑controlled smart IoT devices. The PSA‑Net processes raw audios directly and eliminates the need for dataset‑dependent handcrafted features or pre‑computed spectrograms. Furthermore, PSA‑Net employs a split‑transform‑aggregate approach, which involves the segmentation of utterances, the extraction of intrinsic differentiable embeddings through convolutions, and the aggregation of them to distinguish legitimate from spoofed audios. In contrast to existing deep Resnet‑oriented solutions, we incorporate cardinality as an additional dimension in our network, which enhances the PSA‑Net ability to generalize across diverse attacks. The results show that the PSA‑Net achieves more consistent performance for different attacks that exist in current anti‑spoofing solutions.

Abstract:
We present three biometric datasets (iCarB‑Face, iCarB‑Fingerprint, iCarB‑Voice) containing face videos, fingerprint images, and voice samples, collected inside a car from 200 consenting volunteers. The data was acquired using a near‑infrared camera, two fingerprint scanners, and two microphones, while the volunteers were seated in the driver's seat of the car. The data collection took place while the car was parked both indoors and outdoors, and different "noises" were added to simulate non‑ideal biometric data capture that may be encountered in real‑life driver recognition. Although the datasets are specifically tailored to in‑vehicle biometric recognition, their utility is not limited to the automotive environment. The iCarB datasets, which are available to the research community, can be used to: (i) evaluate and benchmark face, fingerprint, and voice recognition systems (we provide several evaluation protocols); (ii) create multimodal pseudo‑identities, to train/test multimodal fusion algorithms; (iii) create Presentation Attacks from the biometric data, to evaluate Presentation Attack Detection algorithms; (iv) investigate demographic and environmental biases in biometric systems, using the provided metadata. To the best of our knowledge, ours are the largest and most diverse publicly available in‑vehicle biometric datasets. Most other datasets contain only one biometric modality (usually face), while our datasets consist of three modalities, all acquired in the same automotive environment. Moreover, iCarB‑Fingerprint seems to be the first publicly available in‑vehicle fingerprint dataset. Finally, the iCarB datasets boast a rare level of demographic diversity among the 200 data subjects, including a 50/50 gender split, skin colours across the whole Fitzpatrick‑scale spectrum, and a wide age range (18‑60+). So, these datasets will be valuable for advancing biometrics research.

Abstract:
Current Face Anti‑spoofing (FAS) models tend to make overly confident predictions even when encountering unfamiliar scenarios or unknown presentation attacks, which leads to serious potential risks. To solve this problem, we propose a Confidence Aware Face Anti‑spoofing (CA‑FAS) model, which is aware of its capability boundary, thus achieving reliable liveness detection within this boundary. To enable the CA‑FAS to "know what it doesn't know", we propose to estimate its confidence during the prediction of each sample. Specifically, we build Gaussian distributions for both the live faces and the known attacks. The prediction confidence for each sample is subsequently assessed using the Mahalanobis distance between the sample and the Gaussians for the "known data". We further introduce the Mahalanobis distance‑based triplet mining to optimize the parameters of both the model and the constructed Gaussians as a whole. Extensive experiments show that the proposed CA‑FAS can effectively recognize samples with low prediction confidence and thus achieve much more reliable performance than other FAS models by filtering out samples that are beyond its reliable range.

Abstract:
Face recognition technologies are increasingly used in various applications, yet they are vulnerable to face spoofing attacks. These spoofing attacks often involve unique 3D structures, such as printed papers or mobile device screens. Although stereo‑depth cameras can detect such attacks effectively, their high‑cost limits their widespread adoption. Conversely, two‑sensor systems without extrinsic calibration offer a cost‑effective alternative but are unable to calculate depth using stereo techniques. In this work, we propose a method to overcome this challenge by leveraging facial attributes to derive disparity information and estimate relative depth for anti‑spoofing purposes, using non‑calibrated systems. We introduce a multi‑modal anti‑spoofing model, coined Disparity Model, that incorporates created disparity maps as a third modality alongside the two original sensor modalities. We demonstrate the effectiveness of the Disparity Model in countering various spoof attacks using a comprehensive dataset collected from the Intel RealSense ID Solution F455. Our method outperformed existing methods in the literature, achieving an Equal Error Rate (EER) of 1.71% and a False Negative Rate (FNR) of 2.77% at a False Positive Rate (FPR) of 1%. These errors are lower by 2.45% and 7.94% than the errors of the best comparison method, respectively. Additionally, we introduce a model ensemble that addresses 3D spoof attacks as well, achieving an EER of 2.04% and an FNR of 3.83% at an FPR of 1%. Overall, our work provides a state‑of‑the‑art solution for the challenging task of anti‑spoofing in non‑calibrated systems that lack depth information.

Abstract:
This paper presents a human verification scheme in two independent stages to overcome the vulnerabilities of attacks and to enhance security. At the first stage, a hand image‑based CAPTCHA (HandCAPTCHA) is tested to avert automated bot‑attacks on the subsequent biometric stage. In the next stage, finger biometric verification of a legitimate user is performed with presentation attack detection (PAD) using the real hand images of the person who has passed a random HandCAPTCHA challenge. The electronic screen‑based PAD is tested using image quality metrics. After this spoofing detection, geometric features are extracted from the four fingers (excluding the thumb) of real users. A modified forward‑backward (M‑FoBa) algorithm is devised to select relevant features for biometric authentication. The experiments are performed on the Bogazici University (BU) and the IIT‑Delhi (IITD) hand databases using the k‑nearest neighbor and random forest classifiers. The average accuracy of the correct HandCAPTCHA solution is 98.5%, and the false accept rate of a bot is 1.23%. The PAD is tested on 255 subjects of BU, and the best average error is 0%. The finger biometric identification accuracy of 98% and an equal error rate (EER) of 6.5% have been achieved for 500 subjects of the BU. For 200 subjects of the IITD, 99.5% identification accuracy, and 5.18% EER are obtained.

Abstract:
Smartphone‑based contactless fingerphoto authentication has become a reliable alternative to traditional contact‑based fingerprint biometric systems owing to rapid advances in smartphone camera technology. Despite its convenience, fingerprint authentication through fingerphotos is more vulnerable to presentation attacks, which has motivated recent research efforts towards developing fingerphoto Presentation Attack Detection (PAD) techniques. However, prior PAD approaches utilized supervised learning methods that require labeled training data for both bona fide and attack samples. This can suffer from two key issues, namely (i) generalization:the detection of novel presentation attack instruments (PAIs) unseen in the training data, and (ii) scalability:the collection of a large dataset of attack samples using different PAIs. To address these challenges, we propose a novel unsupervised approach based on a state‑of‑the‑art deep‑learning‑based diffusion model, the Denoising Diffusion Probabilistic Model (DDPM), which is trained solely on bona fide samples. The proposed approach detects Presentation Attacks (PA) by calculating the reconstruction similarity between the input and output pairs of the DDPM. We present extensive experiments across three PAI datasets to test the accuracy and generalization capability of our approach. The results show that the proposed DDPM‑based PAD method achieves significantly better detection error rates on several PAI classes compared to other baseline unsupervised approaches.

Abstract:
Self‑supervised learning (SSL) models for speaker verification (SV) have gained significant attention in recent years. However, existing SSL‑based SV systems often struggle to capture local temporal dependencies and generalize across different tasks. In this paper, we propose context‑aware multi‑head factorized attentive pooling (CA‑MHFA), a lightweight framework that incorporates contextual information from surrounding frames. CA‑MHFA leverages grouped, learnable queries to effectively model contextual dependencies while maintaining efficiency by sharing keys and values across groups. Experimental results on the VoxCeleb dataset show that CA‑MHFA achieves EERs of 0.42%, 0.48%, and 0.96% on Vox1‑O, Vox1‑E, and Vox1‑H, respectively, outperforming complex models like WavLM‑TDNN with fewer parameters and faster convergence. Additionally, CA‑MHFA demonstrates strong generalization across multiple SSL models and tasks, including emotion recognition and anti‑spoofing, highlighting its robustness and versatility.

Abstract:
The ASVspoof 2021 benchmark, a widely‑used evaluation framework for anti‑spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state‑of‑the‑art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real‑world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large‑scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%.

Abstract:
We propose a novel approach for spoofed speech characterization through explainable probabilistic attribute embeddings. In contrast to high‑dimensional raw embeddings extracted from a spoofing countermeasure (CM) whose dimensions are not easy to interpret, the probabilistic attributes are designed to gauge the presence or absence of sub‑components that make up a specific spoofing attack. These attributes are then applied to two downstream tasks: spoofing detection and attack attribution. To enforce interpretability also to the back‑end, we adopt a decision tree classifier. Our experiments on the ASVspoof2019 dataset with spoof CM embeddings extracted from three models (AASIST, Rawboost‑AASIST, SSL‑AASIST) suggest that the performance of the attribute embeddings are on par with the original raw spoof CM embeddings for both tasks. The best performance achieved with the proposed approach for spoofing detection and attack attribution, in terms of accuracy, is 99.7% and 99.2%, respectively, compared to 99.7% and 94.7% using the raw CM embeddings. To analyze the relative contribution of each attribute, we estimate their Shapley values. Attributes related to acoustic feature prediction, waveform generation (vocoder), and speaker modeling are found important for spoofing detection; while duration modeling, vocoder, and input type play a role in spoofing attack attribution.

Abstract:
Mainstream zero‑shot TTS production systems like Voicebox and Seed‑TTS achieve human parity speech by leveraging Flow‑matching and Diffusion models, respectively. Unfortunately, human‑level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state‑of‑the‑art anti‑spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow‑matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti‑spoofing models lack sufficient robustness against highly human‑like audio generated by diffusion and flow‑matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti‑spoofing models.

Abstract:
The effects of language mismatch impact speech anti‑spoofing systems, while investigations and quantification of these effects remain limited. Existing anti‑spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language‑independent models. We initiate this work by evaluating top‑performing speech anti‑spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach ‑ Accent‑based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual‑trained models, improving their cross‑lingual capabilities. We conduct experiments on a large‑scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low‑resource language scenarios.

Abstract:
This paper proposes a Few‑shot Learning (FSL) approach for detecting Presentation Attacks on ID Cards deployed in a remote verification system and its extension to new countries. Our research analyses the performance of Prototypical Networks across documents from Spain and Chile as a baseline and measures the extension of generalisation capabilities of new ID Card countries such as Argentina and Costa Rica. Specifically targeting the challenge of screen display presentation attacks. By leveraging convolutional architectures and meta‑learning principles embodied in Prototypical Networks, we have crafted a model that demonstrates high efficacy with Few‑shot examples. This research reveals that competitive performance can be achieved with as Few‑shots as five unique identities and with under 100 images per new country added. This opens a new insight for novel generalised Presentation Attack Detection on ID cards to unknown attacks.

Abstract:
In real‑world applications, it is challenging to build a speaker verification system that is simultaneously robust against common threats, including spoofing attacks, channel mismatch, and domain mismatch. Traditional automatic speaker verification (ASV) systems often tackle these issues separately, leading to suboptimal performance when faced with simultaneous challenges. In this paper, we propose an integrated framework that incorporates pair‑wise learning and spoofing attack simulation into the meta‑learning paradigm to enhance robustness against these multifaceted threats. This novel approach employs an asymmetric dual‑path model and a multi‑task learning strategy to handle ASV, anti‑spoofing, and spoofing‑aware ASV tasks concurrently. A new testing dataset, CNComplex, is introduced to evaluate system performance under these combined threats. Experimental results demonstrate that our integrated model significantly improves performance over traditional ASV systems across various scenarios, showcasing its potential for real‑world deployment. Additionally, the proposed framework's ability to generalize across different conditions highlights its robustness and reliability, making it a promising solution for practical ASV applications.

Abstract:
This paper summarises the Competition on Presentation Attack Detection on ID Cards (PAD‑IDCard) held at the 2024 International Joint Conference on Biometrics (IJCB2024). The competition attracted a total of ten registered teams, both from academia and industry. In the end, the participating teams submitted five valid submissions, with eight models to be evaluated by the organisers. The competition presented an independent assessment of current state‑of‑the‑art algorithms. Today, no independent evaluation on cross‑dataset is available; therefore, this work determined the state‑of‑the‑art on ID cards. To reach this goal, a sequestered test set and baseline algorithms were used to evaluate and compare all the proposals. The sequestered test dataset contains ID cards from four different countries. In summary, a team that chose to be "Anonymous" reached the best average ranking results of 74.80%, followed very closely by the "IDVC" team with 77.65%.