arXiv Papers with Code in Machine Learning (January 2026 - June 2026)

PaperId: 1, https://arxiv.org/pdf/2606.06481.pdf   GitHub GitHub
Authors:Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li, Salman Khan, Zhiqiang Shen
Title: Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection
Abstract:
As AI writing assistants become increasingly integrated into real‑world drafting and revision workflows, many documents are no longer purely human‑written or AI‑generated, but instead result from progressive human‑AI co‑editing. However, existing AI‑text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI‑Bench, an operation‑guided benchmark for studying progressive human‑to‑AI text transformation across document, sentence, token, and span granularities. Starting from human‑written documents, OpAI‑Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document‑level detectors, 7 sentence‑level detectors, and 2 fine‑grained token/span‑level detectors. Experiments reveal that AI‑text detectability is governed not only by the proportion of AI‑edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed‑authorship intermediate versions are often harder to detect than both fully human and heavily AI‑edited endpoints, exposing non‑monotonic detection patterns missed by existing benchmarks. OpAI‑Bench provides a controlled testbed for analyzing whether, when, and how AI‑assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA‑Lab/OpAI‑Bench.

Authors:Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang, Kunxiang Zhao, Alex Schwing, Ruoyu Sun
Title: PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
Abstract:
We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular‑value spectrum of weight matrices via low‑degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama‑1B pre‑training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum‑control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath‑aln/PC‑layer.

Authors:Shuo Wang, Xiangyu Wang, Quanxin Wang, Bailin Wu, Bokui Wang, Shunyang Huang, Boyan Deng, Haonan Liu, Ruiyi Fang, Zhenxiang Xu, Boyu Wang, Zhao Kang
Title: The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning
Abstract:
Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces systematic bias: it obscures geometry‑dependent performance variations and can lead to misleading conclusions about model generalization. In this work, we identify intrinsic geometry as a key latent factor governing model effectiveness. We demonstrate that conventional aggregated metrics mask critical performance trade‑offs that only become visible when datasets are stratified by their geometric properties. To address this issue, we introduce a curvature‑stratified evaluation framework that partitions datasets into positive, negative, and near‑zero curvature regimes. Our benchmark evaluates 18 representative models including Graph Convolutional Networks (GCNs), Graph Foundation Models (GFMs), and tabular learning methods across 14 datasets. We find that model rankings are highly stable within each curvature regime but shift significantly across regimes, indicating that performance is fundamentally geometry‑dependent rather than universally transferable. Notably, we identify regimes where GFMs offer diminishing returns compared to geometry‑aligned GNNs. Based on these findings, we propose a geometry‑aware evaluation protocol that yields more reliable and interpretable comparisons than standard aggregated benchmarks. We release all code, curvature‑stratified dataset splits, and evaluation tools to support reproducible and rigorous assessment of future relational learning methods. Code and datasets are provided in our project homepage: https://sirbabbage.github.io/CurvBench_HOME/.

Authors:Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi
Title: Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
Abstract:
Multi‑turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key‑Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non‑uniform KV compression effectively preserves more information by considering the individual importance of each KV cache. However, such KV cache heterogeneity introduces various systemic challenges ‑ including memory fragmentation, scheduling complexities, and diminished kernel utilization ‑ which collectively lead to significant inefficiencies in existing LLM serving systems. To overcome these challenges, we present Tangram, a novel serving system designed to make Non‑uniform KV caches practical. Tangram addresses systemic inefficiencies through three core techniques: (1) Deterministic Budget Allocation assigns a static memory footprint to each head based on its intrinsic pattern, entirely eliminating dynamic scheduling overhead and prefill stalls; (2) Head Group Page clusters attention heads with similar retention demands and manages them with independent, vectorized page tables, thereby maximizing physical memory reclamation; and (3) Ahead‑of‑Time (AOT) Load Balancing leverages static budget profiles to ensure uniform GPU utilization without runtime overhead. Experimental results show that Tangram improves throughput by up to 2.6x compared to existing baselines, while fully preserving model accuracy. Our implementation is publicly available at https://github.com/aiha‑lab/TANGRAM.

Authors:Nanxi Chen, Chuanjie Cui, Airong Chen, Sifan Wang, Rujin Ma
Title: On the training of physics-informed neural operators for solving parametric partial differential equations
Abstract:
Physics‑informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input‑output simulation data. By incorporating physical constraints into the training objective, PINOs combine the cross‑instance generalization of neural operators with the data efficiency of physics‑informed learning. Despite this promise, how to train PINOs efficiently and robustly remains less well‑understood than the training of either data‑driven neural operators or physics‑informed neural networks (PINNs). To bridge this gap, we examine key components of the PINO training pipeline, including architecture design, optimizer choice, loss balancing, and collocation‑point sampling strategy. We study three representative operator backbones, Deep Operator Network (DeepONet), Fourier Neural Operator (FNO), and Continuous Vision Transformer (CViT), across five diverse parametric PDE systems. Our results show that CViT provides consistently strong and stable performance across the considered benchmarks. Beyond architecture, we find that several optimization pathologies previously identified in PINN training naturally arise in PINOs, including gradient conflicts and causal violation. We also find that mitigation algorithms developed for PINNs remain effective in the PINO setting. We further compare physics‑informed and data‑driven training under different data regimes, revealing that a carefully designed physics‑informed training pipeline can match, and in some cases, outperform purely data‑driven neural operators. Taken together, these findings provide a systematic empirical understanding of the optimization challenges in PINO training and inform a practical pipeline for efficient and robust physics‑informed operator learning. Code and data are available at https://github.com/NanxiiChen/PI‑CViT.

Authors:Tirtharaj Dash, Gunja Sachdeva
Title: $p$-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences
Abstract:
We introduce pVR, a topological machine learning framework for alignment‑free genomic sequence classification that combines p‑adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a p‑adic distance on k‑mer prefixes, which captures hierarchical positional structure, and a compositional L_1 distance on k‑mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi‑filtered Vietoris‑‑Rips complex, and per‑sequence topological summaries from this bi‑filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single p‑adic axis is topologically uninformative and why the bi‑filtration recovers nontrivial homology. On twelve genomic benchmarks (28 to 500 sequences, 3 to 7 classes), pVR outperforms four established alignment‑free baselines on three of six low‑sample datasets, with gains of up to 21 percentage points; it underperforms only on a SARS‑CoV‑2 variant benchmark whose point‑mutation divergence violates the hierarchical assumption, and all methods saturate in the large‑sample regime. pVR also outperforms zero‑shot frozen embeddings from the 500M‑parameter Nucleotide Transformer v2 by 6.7 to 11.4 percentage points on three low‑sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI‑Group/pVR.

Authors:Chen Hu, Rui Wang, Jiale Zhou, Jingjun Yi, Shaocheng Jin, Yidong Song, Yefeng Zheng
Title: A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding
Abstract:
Electroencephalography (EEG) offers noninvasive, millisecond resolution recordings of neuronal activity and is widely used in neuroscience and healthcare. Many EEG decoding pipelines rely on covariance descriptors for their robustness to noise, but such representations are sensitive to channel‑wise scaling. Recent studies have therefore advocated full‑rank correlation matrices as a scale‑invariant alternative for EEG decoding. In this paper, we propose a general framework for Sliced Wasserstein (SW) discrepancies on manifolds endowed with Pullback Euclidean Metrics (PEMs), termed Pullback Euclidean Metric Sliced Wasserstein (PEMSW). Within this framework, we instantiate two Correlation Sliced‑Wasserstein (CorSW) discrepancies on the manifold of full‑rank correlation matrices under two recently introduced correlation geometries, i.e., the Off‑Log Metric (OLM) and Log‑Scaled Metric (LSM). Building on CorSW, we further develop a domain generalization (DG) framework for EEG decoding. Experiments on three EEG datasets demonstrate improved generalization under distribution shifts, with low training overhead and no additional inference cost. The source code is available at https://github.com/ChenHu‑ML/CorSW.

Authors:Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Title: OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation
Abstract:
Policy‑gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best‑of‑K discovery. We introduce OrderGrad, a family of likelihood‑ratio and reparameterization gradient estimators for order‑statistic objectives. OrderGrad optimizes finite‑sample L‑statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top‑m/best‑of‑K criteria by changing only the rank weights. For any fixed sample size and rank‑weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order‑statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy‑gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post‑training and other tasks. OrderGrad provides a unified, plug‑and‑play route to risk‑averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

Authors:Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman
Title: RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit
Abstract:
Community‑conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit‑based, graph‑structural, semantic, hybrid, and interaction‑based), trains a parameter‑efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well‑being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade‑off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

Authors:Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen
Title: OPRD: On-Policy Representation Distillation
Abstract:
On‑policy distillation (OPD) supervises the student only in output space by matching next‑token probabilities. This output‑only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black‑box, discarding all intermediate hidden states after the LM head. We propose On‑Policy Representation Distillation (OPRD), which lifts distillation into hidden‑state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per‑layer structural information. Empirically, OPRD closes the student‑teacher gap on AIME 2024/2025 and AIMO, while output‑space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top‑k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

Authors:Yoshiyuki Ootani
Title: Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder
Abstract:
Aggressive distillation of the diffusion U‑Net inverts the per‑frame bottleneck of real‑time text‑to‑image pipelines: once the denoiser is a 4‑step or 1‑step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision‑aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U‑Net paired with a 2.13B MLLM text encoder (Qwen3‑VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side‑stream / main‑stream CUDA pipelining with batched text‑encoder amortisation (and optional static‑prompt caching), a compile‑friendly ControlNet‑LLLite reformulation that folds the entire U‑Net + adapter stack into a single fused graph, and a periodic conditioning‑refresh schedule with a hook subset that amortises the per‑frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480‑frame run at batch size B=8 and 29.6 fps at B=16, with end‑to‑end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video‑rate streaming throughput rather than interactive low latency, and locate our numbers against same‑stack StreamDiffusion re‑runs as systems context, not as a benchmark superiority claim. For the trained oil‑painting style, the released temporal adapter generalises within in‑clip noise to 19 unused DAVIS‑2017 sequences and 15 non‑DAVIS clips from seven sources; prompt‑level generalisation to unseen style families is bounded and reported separately.

Authors:Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia
Title: Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Abstract:
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground‑truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self‑supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re‑solves them in parallel. The agent analyzes these rollouts using self‑validation and self‑consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self‑preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE‑Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long‑horizon sessions.

Authors:Kerod Woldesenbet, Abem Woldesenbet
Title: T-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction
Abstract:
We present T‑SAR‑JEPA, a self‑supervised framework for temporal anomaly detection in SAR amplitude stacks via latent prediction. A ViT‑Base/16 encoder from SAR‑JEPA is domain‑adapted on 39,300 Capella patches using local masked reconstruction with gradient feature prediction. A temporal transformer with sinusoidal time encoding forecasts future latent states from K=7 acquisitions, with progressive unfreezing substantially reducing validation loss. The model operates on amplitude alone; InSAR coherence serves exclusively as independent pseudo‑ground‑truth. On the DFC 2026 dataset (300 time‑series, three AOIs), T‑SAR‑JEPA achieves ROC‑AUC of 77.0% on the Hawaii eruption window, outperforming RX, PaDiM, Linear AR, and LSTM baselines (~50%). Spatial coherence of 99.9% (p < 0.001, permutation test) confirms structured detections. Code: https://github.com/TerraLatent/t‑sar‑jepa

Authors:Hongye Xu, Bartosz Krawczyk
Title: Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss
Abstract:
Exemplar‑free class‑incremental learning (EFCIL) aims to acquire new classes over time without storing raw data. Historically, prototype rehearsal, which samples around stored class prototypes and mixes them with current‑task data, has been a popular strategy to reduce catastrophic forgetting. However, recent drift‑compensation methods that explicitly realign prototypes in the evolving feature space consistently outperform prototype‑based rehearsal, raising the question of whether rehearsal itself is fundamentally limited. We argue that the performance gap stems not from the idea of prototype rehearsal per se, but from how it is typically instantiated: existing approaches treat prototypes as isolated class summaries that ignore information from nearby enemy classes, and fail to correct the emerging class imbalance between a handful of synthetic old‑class samples and hundreds of real instances from newly introduced classes. Building on this hypothesis, we revisit prototype rehearsal and propose a manifold‑aware variant that restores its competitiveness in EFCIL. First, we introduce Constrained Expansive Over‑Sampling, which interpolates each old‑class prototype toward its nearest enemy features from new classes, generating boundary‑aware rehearsal samples that better follow the underlying data manifold while preserving inter‑class separation. Second, we design an Adaptive Class‑Balanced loss that performs time‑based class weighting, amplifying gradients from older prototypes when they are most informative and gradually annealing their influence as richer supervision from later tasks accumulates. Together, these components turn prototype rehearsal into a drift‑resilient, imbalance‑aware mechanism that closes, and often reverses, the gap to recent drift‑compensation methods, achieving state‑of‑the‑art performance across multiple EFCIL benchmarks.

Authors:Hongye Xu, Bartosz Krawczyk
Title: Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning
Abstract:
Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar‑free class‑incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype‑based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; therefore, projection‑based drift compensation has become a popular remedy. We show, however, that existing one‑directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce BiCyc, a bidirectional projector alignment approach with a cycle‑consistency objective. BiCyc jointly optimizes two maps, old‑to‑new and new‑to‑old, with stop‑gradient gating so that transport and representation co‑evolve. Analytically, we show that the cycle loss contracts the singular spectrum toward unity in whitened space, and that improved transport of class means and covariances yields smaller perturbations of classification log‑odds, preserving old‑class decisions and mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, BiCyc substantially reduces forgetting and improves accuracy in from‑scratch settings, while remaining competitive in the pretrained fine‑grained regime.

Authors:Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, Woojin Lee
Title: SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks
Abstract:
As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization‑based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emphslots, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emphslots. Based on these findings, we introduce the Vulnerable Slot Score (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position‑search mechanism that is attack‑agnostic and can be plugged into any optimization‑based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14% higher Attack Success Rates (ASR) over GCG‑based attacks, converges faster, and shows superior robustness against defense methods with 42% higher ASR than baseline approaches. Our implementation is available at \hrefhttps://github.com/youai058/SlotGCGhttps://github.com/youai058/SlotGCG

Authors:Ayano Hiranaka, Ya-Chuan Hsu, Stefanos Nikolaidis, Erdem Bıyık, Daniel Seita
Title: Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization
Abstract:
AI assistants in human‑AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering‑wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long‑term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action‑ or trajectory‑level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long‑horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero‑shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single‑misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long‑horizon task performance, successfully correcting 90% of student misconceptions. Code and project page are available at https://misoshiruseijin.github.io/SENSEI/.

Authors:Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari
Title: Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models
Abstract:
Despite the rapid progress of Vision‑Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human‑like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human‑grounded, bilingual (English‑Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image‑question‑answer tasks. Built with a semi‑automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state‑of‑the‑art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state‑of‑the‑art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross‑lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar‑Oryx‑BloomBench.

Authors:Joong Ho Kim, Keith G. Mills
Title: Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?
Abstract:
Diffusion Models (DM) have revolutionized text‑driven generation by enabling the synthesis of high‑quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics (HPM) that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.

Authors:Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, Donald A. Adjeroh
Title: GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data
Abstract:
We investigate how to make small tabular foundation models effective for High‑Dimensional, Low‑Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph‑guided Ordering with Local Refinement (GO‑LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP‑path‑style surrogate. We propose GOTabPFN,which builds on GO‑LR, and a Neuro‑Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta‑features, yielding a compact representation that makes TabPFN‑style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.

Authors:Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Zhu Chenyu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, AndyZeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lv, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Qiu Shi, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Benjamin Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, Ren He, Zhenyu He, Qiao Jin, Lang Lang, Yuetai Li, Sylvia Liu, Lu Lu, Qing Lu, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Dawn Song
Title: Agents' Last Exam
Abstract:
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long‑horizon, economically valuable, real‑world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non‑physical industries defined with reference to ONET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP‑relevant impact.

Authors:Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He
Title: Harnessing Generalist Agents for Contextualized Time Series
Abstract:
Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real‑world practitioners often require end‑to‑end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series‑native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience‑driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open‑ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real‑world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA‑iSAIL‑Lab‑UIUC/TimeClaw.

Authors:Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu
Title: LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization
Abstract:
Long‑horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi‑agent harness for reliable research‑level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural‑language proof graph, and shared system of record. Four contract‑scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two‑stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI‑gated rounds. LeanMarathon turns one brittle multi‑hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co‑mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

Authors:Yaobo Zhang
Title: PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention
Abstract:
We unify RoPE's Fourier phase, Jordan‑RoPE's finite jets, and ALiBi's affine recency into a single learnable relative‑position space, and study which regions of this space are selected by different tasks. PJ‑RoPE is a Fourier‑Jet‑Affine formulation for relative attention, with an optional Poincare‑type reading as the affine completion of a homogeneous Fourier‑jet positional representation. Algebraically, the same primitives form a finite constant‑coefficient difference module: simple roots of the lag‑shift operator give Fourier/RoPE characters, repeated nonzero roots give Jordan/Fourier jets, and the repeated unit root gives ALiBi‑like affine recency. The framework separates scalar PJ‑bias kernels from exact PJ‑rotary feature transforms, introduces adaptive sector diagnostics, and uses LC/rapidity coordinates to stabilize high‑order jets. Controlled probes verify sector containment and selection; small language runs expose an affine/recency boundary; music‑token streams provide the clearest case where LC/affine variants remain strong while carrying measurable high‑order corrections; and LC diagnostics show a scale‑stability gain coupled to phase‑resolution loss.

Authors:Raghav Kansal, David Crair, Nghia Nguyen, Scott Pope, Bradley Parry
Title: Multimarginal flow matching with optimal transport potentials
Abstract:
Flow matching (FM) has emerged as a powerful framework for learning dynamic transport maps between two empirical distributions. However, less explored is the setting with intermediate observed marginals that can help constrain the flows between the endpoints. This "multimarginal" regime is central to modeling temporal evolution in dynamical systems in many scientific domains that can sample sequential distributions. We tackle this problem with a novel approach that leverages the connection between FM and dynamic optimal transport (OT), softly steering the flow towards the intermediate marginals through potential terms in the dynamic OT action. By extending the conditional FM learning target to incorporate these potentials, we derive an efficient, simulation‑free algorithm for multimarginal FM that offers considerable flexibility in the spatiotemporal dynamics of the learned flows. We demonstrate state‑of‑the‑art performance and training efficiency of OT‑potential FM (OTP‑FM) on diverse single‑cell RNA sequencing, oceanographic, and meteorological datasets. Our code is available at https://github.com/Bexorg‑Inc/OTP‑FM.

Authors:Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, Noël Vouitsis, Brendan Leigh Ross
Title: Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Abstract:
LLM agents operate in two distinct regimes: open‑weight agents amenable to reinforcement learning (RL) and black‑box agents whose behaviour must be controlled purely at test time. Although black‑box agents are often backed by state‑of‑the‑art proprietary LLMs, API‑only access precludes parameter‑level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black‑box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black‑box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black‑box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test‑time compute of our method. AMC demonstrates the feasibility of performing principled RL‑style optimization of black‑box LLM agents. Code is available at https://github.com/layer6ai‑labs/Agentic‑Monte‑Carlo

Authors:Nadav Benedek, Ariel Shamir, Ohad Fried
Title: NIV: Neural Axis Variations for Variable Font Generation
Abstract:
Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor‑intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per‑point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi‑axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high‑complexity CJK glyphs, and even out‑of‑distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at https://github.com/ndvbd/NIV. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.

Authors:Hongfan Gao, Wangmeng Shen, Bin Yang, Jilin Hu
Title: HyFAD: Hybrid Time-Frequency Diffusion with Frequency-Aware Embedding for Time Series Imputation
Abstract:
Diffusion models have demonstrated strong performance in time series modeling due to their ability to progressively capture complex data distributions through iterative denoising. However, existing approaches struggle with frequency‑sensitive denoising, high‑frequency reconstruction and balancing global trends with local dynamics. To address these limitations, we propose HyFAD, a Hybrid time‑frequency Diffusion model with Frequency‑Aware embedding for time series imputation. Built upon the DDPM paradigm, HyFAD adopts a coupled time‑frequency diffusion framework, in which the reverse denoising proceeds sequentially from the time domain to the frequency domain, enabling coarse‑to‑fine generation. Specifically, the time‑domain diffusion process captures low‑frequency global trends, while the frequency‑domain diffusion process refines high‑frequency spectral components. We further introduce a frequency‑aware step embedding that exploits the relationship between diffusion steps and spectral components, providing step‑dependent spectral guidance and facilitates more accurate band‑wise reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that HyFAD achieves state‑of‑the‑art performance. Our source code is available at https://github.com/hongfangao/HyFAD.

Authors:Dong Liu, Yanxuan Yu, Ben Lengerich, Tony Geng, Ying Nian Wu
Title: OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons
Abstract:
Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real‑world environments or individual user characteristics. We present \olive (\underlineOnline \underlineLow‑rank \underlineIncremental Learning for Efficient Adapti\underlineve Exoskeletons), a parameter‑efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low‑rank residual form~\dW = \At\Bt^\top with rank~r!\ll!\min(d,k), reducing online update cost from \mathcalO(dk) to \mathcalO(r(d+k)) while preserving the stability of a pretrained base controller~\Wz. Parameters are updated via a reward‑shaped policy gradient driven purely by on‑body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity ‑‑ allocating minimal capacity on simple flat terrain and expanding to higher‑rank updates on demanding uneven surfaces ‑‑ enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage‑point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within ~1,800 walking steps at 7.4,ms end‑to‑end latency. Our code implementation is available at https://github.com/FastLM/OLIVE.

Authors:Federico J. Gonzalez
Title: PyCC.id: A package for hypothesis-driven equation discovery with structural identifiability
Abstract:
Data‑driven equation discovery is fundamentally an inverse problem that seeks to infer the governing differential equations of a system directly from time‑series measurements. A known issue is the ill‑conditioned nature of the inverse problem, which frequently produces multiple mathematical models that fit the data similarly well. One path to address this issue is by incorporating known hypotheses and constraints into the training phase beforehand. While this approach effectively reduces the search space, it still results in multiple candidate models, forcing practitioners to rely on post‑hoc manual filtering based on their own domain expertise. A recent approach incorporates structural `skeletons' inspired by characteristic curves (CCs), defining a hypothesis‑driven methodology. In this methodology, practitioners define a skeleton, which is associated with a family of ordinary differential equations (ODEs), and then add their hypotheses and priors based on their domain knowledge to refine the obtained model iteratively. An important advantage of this approach is that some skeletons have demonstrable structural identifiability properties, which are useful for checking whether the skeleton is correct or should be discarded. Furthermore, this formalism enables the use of multiple equation discovery paradigms due to its modularity (such as neural networks, symbolic regression, and sparse regression). In this work, we present the Python library PyCC, which condenses these efforts into a flexible tool that allows researchers and engineers to seamlessly define their skeletons and hypotheses to discover ODEs from time‑dependent data.

Authors:Linyao Chen, Qinlao Zhao, Zechen Li, Mingming Li, Likun Ni, Jinyu Chen, Yuhao Yao, Xuan Song, Noboru Koshizuka, Hiroki Kobayashi
Title: Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent
Abstract:
Individual‑level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task‑specific training and offer limited decision‑level transparency. Recent LLM‑based methods improve interpretability, yet mostly rely on static prompts and single‑pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method, a training‑free LLM‑driven agent framework that formulates next‑location prediction as adaptive evidence‑controlled decision making. \method resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay‑move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training‑free LLM‑based methods, with GPT‑5.4 reaching 71.42% Acc@1 on BW, 33.14% on YJMob100K, and 33.50% on Shanghai ISP. On BW non‑fast‑path cases, the LLM controller improves Acc@1 from 30.65% to 48.62% over a same‑tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown‑zoo/AgentMob.

Authors:Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen, Hang Hua, Zijian Wu, Zheyuan Liu, Zexue He, Lichi Li, Shizhe Diao, Jiaxin Pei, Jinsung Yoon, Hao Zhang, Mengdi Wang, Radha Poovendran, Misha Sra, Alex Pentland, Zichen Chen
Title: AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
Abstract:
Scientific and engineering progress is fundamentally a long‑horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single‑turn responses or short‑horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long‑horizon closed‑loop optimization. AutoLab consists of 36 realistic, expert‑curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall‑clock budget. Evaluating 17 state‑of‑the‑art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude‑opus‑4.6 exhibits strong long‑horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open‑source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long‑horizon agents.

Authors:Marc Walden, Jason Liu, Shaashwath Sivakumar, Ryan Liu, Hamza Khan
Title: Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling
Abstract:
We investigate multi‑agent deep reinforcement learning and propose two enhancements to the Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, thereby improving the accuracy and stability of its own policy. Second, we apply an importance sampling strategy, using geometric distribution, in the replay buffer to prioritize more recent and informative experiences, which helps mitigate the non‑stationarity inherent in multi‑agent environments. We evaluate both modifications on the discrete‑action Predator‑Prey task provided by the PettingZoo library, a flexible Python interface for general multi‑agent reinforcement learning benchmarks. Our results indicate that Action Inference is effective in improving learning stability and inter‑agent cooperation and that importance sampling using geometric distribution can lead to significant improvements in exploration efficiency over standard MADDPG. Code available at https://github.com/shaashwathsivakumar/MARL_Proj

Authors:Wanqi Yang, Yuexiao Ma, Alexander Conzelmann, Xiawu Zheng, Michael W. Mahoney, T. Konstantin Rusch, Shiwei Liu
Title: AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization
Abstract:
Mixture‑of‑Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory‑bound because all expert weights must reside in memory. Mixed‑precision quantization can substantially reduce this footprint by assigning different bit‑widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are inevitably imperfect surrogates, and this can misestimate expert utilization and lead to suboptimal bit allocation. Motivated by the substantial cross‑expert quality variability observed in modern MoE models, and by the success of Heavy‑Tailed Self‑Regularization (HT‑SR) theory at predicting neural network model quality without access to training or testing data, we propose AlphaQ, a calibration‑free bit‑allocation method for MoE quantization. AlphaQ draws on HT‑SR theory and follows a simple principle: experts with more heavy‑tailed weight spectra are typically better trained and hence should receive higher bit‑widths, while experts with weaker heavy‑tailed structure can be quantized more aggressively. AlphaQ operationalizes this principle by measuring expert‑wise spectral heavy‑tailedness and solving a budget‑constrained optimization problem that minimizes total quantization error under a global bit‑budget constraint. Across several MoE models, AlphaQ consistently outperforms calibration‑based baselines under matched bit budgets. Notably, on Qwen1.5‑MoE, AlphaQ achieves near full‑precision accuracy with an average expert precision of only 3.5 bits, while delivering more than 4× memory compression. Our code is available at https://github.com/Superone77/AlphaQ.

Authors:Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath, Yiwei Lu
Title: Sequential Data Poisoning in LLM Post-Training
Abstract:
LLM post‑training proceeds through multiple stages, e.g., supervised fine‑tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post‑training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single‑attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT \to DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT \to PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post‑training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at https://github.com/jcksanderson/sequential‑poisoning.

Authors:Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li, Xiaozhi Wang
Title: Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
Abstract:
Rubric‑based reinforcement learning (RL) uses an LLM‑as‑a‑Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real‑world rubric‑based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric‑based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric‑based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent‑based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS‑Lab/CHERRL.

Authors:Tran Dinh Tien, Zhiqiang Shen
Title: Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models
Abstract:
Current prompt‑based and adapter‑based tuning of vision‑language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground‑truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited‑supervision settings. We propose Omni‑Geometry Knowledge Distillation (OGKD), a new framework that injects class‑relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter‑class geometry. Using these targets, we develop two distillation losses: Global Geometry‑Aware Distillation (GAD) operates on the global image token, and Label‑Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine‑grained alignment. Across comprehensive experiments and analyses on 11 widely‑used medical datasets for base‑to‑novel and few‑shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%‑2.8% over all prior state‑of‑the‑art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

Authors:Chong Zhang, Xiang Li, Jia Wang, Qiufeng Wang, Xiaobo Jin
Title: Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms
Abstract:
The robustness of deep neural networks is crucial for safety‑critical deployments, yet existing evaluation methods are often attack‑dependent and lack interpretability. We propose a principled, attack‑agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which quantifies the worst‑case sensitivity of the model's output distribution to input perturbations. Theoretically, we establish that the FIM equals the variance of the input Jacobian and derive closed‑form spectral bounds for common architectures, including VGG, ResNet, DenseNet, and Transformer, providing the first theoretical robustness ranking. To enable scalable evaluation, we develop efficient algorithms, including power iteration and Hutchinson‑based estimation, that support both white‑box and black‑box settings. Extensive experiments across multiple datasets, including CIFAR, ImageNet, and medical images, and across multiple architectures show a strong correlation between our metric and adversarial vulnerability. Our framework serves as an interpretable diagnostic tool that complements attack‑based evaluations, offering insights into architectural sensitivity and guiding the design of more robust models. Code is available at: https://github.com/franz‑chang/SRP/.

Authors:Ossi Lehtinen
Title: An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers
Abstract:
Transformers consuming multi‑channel scalar signals must embed C simultaneous values into one d_\textmodel‑dimensional vector per time step. We empirically audit eight input encoders ‑‑ spanning a shared‑scalar baseline, per‑channel linear projections, an orthogonality regulariser, a nonlinear MLP stem, block‑partitioned concatenation, channel‑independent and channel‑as‑token architectures, and a projected positional encoding ‑‑ on a synthetic benchmark designed to make channel identity informative and on ETTh1 as a real‑data check, measured in next‑step negative log‑likelihood (NLL). The headline is one of practical near‑equivalence within a wide "top tier": the standard per‑channel linear projection (nn.Linear(C, d_\textmodel)) matches every alternative in that tier up to small, statistically real but practically modest, differences. Two encoders lose decisively: the shared‑scalar baseline, which collapses for information‑theoretic reasons we make explicit, and the channel‑independent PatchTST‑spirit baseline, which underperforms on both benchmarks and overfits universally on the synthetic one. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small C, with a direct geometric probe showing the mechanism is positional‑channel orthogonalisation; a nonlinear MLP stem edges them at the largest C we test, with the gap shrinking under more training data. The practical recommendation is to use nn.Linear(C, d_\textmodel) by default and reach for something more elaborate only when the task at hand gives a real reason to do so. Code and data to reproduce every experiment in this paper are available at https://github.com/OssiLehtinen/channel‑encoder‑audit

Authors:Jannik Presberger, Alexander Männel, Maynard Koch, Thomas C. Schmidt, Matthias Wählisch, Bjoern Andres
Title: Contrastive Learning and Correlation Clustering for Sequences of Network Telescope Data
Abstract:
Understanding activities of Internet scanners is challenging; it often requires identifying relationships between sources, a task for which semantic annotations are scarce. This work investigates whether semantically meaningful pairwise relationships between sequences of network flow records can be estimated by contrastive learning, without pretraining and without annotations. To this end, we propose a transformer model that embeds minimally preprocessed sequences of network flow records and train it using contrastive learning. With the similarities obtained from this model, we state a correlation clustering problem and solve it locally. Experimentally, we show: Learned similarities are higher on average for sequences originating from the same source than for sequences originating from different sources, and this property generalizes to unseen sequences of unseen sources. Moreover, correlation clustering yields clusters consistent with scanner labels. The complete source code of the algorithms and for reproducing the experiments is publicly available.

Authors:Alessandro Gambetti, Qiwei Han, Cláudia Soares, Hong Shen
Title: Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain
Abstract:
Vision‑Language Models (VLMs) struggle when applied to medical image‑text data, yet the tools available to diagnose this failure remain limited. Existing representation alignment metrics are symmetric, collapsing both modalities into a single score and hiding which modality drives cross‑modal degradation. We introduce the Spectral Alignment Score (SAS), an asymmetric metric that projects both modalities onto the principal eigenbasis of an anchor modality and computes eigenvalue‑weighted per‑eigenmode correlations, resulting in directional scores whose difference quantifies modality information imbalance. We embed SAS within a benchmarking framework evaluating 15 VLMs across natural and medical image‑text datasets alongside 6 alignment metrics and bidirectional retrieval. Our experiments show that medical images retain richer structural information than their paired clinical reports, a directional asymmetry invisible to all competing metrics, and that SAS achieves the strongest zero‑label correlation with retrieval performance in the medical domain, positioning it as a practical diagnostic tool for clinical deployment. Code is available at this URL: https://github.com/iamalegambetti/medical‑vlms‑assessment.

Authors:Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa
Title: SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
Abstract:
Sparse attention reduces compute and memory bandwidth for long‑context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains O(T^2) complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per‑layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU‑to‑GPU prefetch with current‑layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi‑head selector. SparDA adds <0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse‑pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25× prefill speedup and 1.7× decode speedup over the sparse‑attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3× higher decode throughput than the non‑offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.

Authors:Yuanrui Wang, Xingxuan Zhang, Han Yu, Mingchao Hao, Gang Ren, Hao Yuan, Li Mao, Yunjia Zhang, Chun Yuan, Peng Cui
Title: LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models
Abstract:
Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute‑inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one‑dimensional channel, and feature IDs/positional signals cannot increase within‑feature value degrees of freedom, yielding weak early‑layer value sensitivity and redundant hidden states. We present a unified tokenize‑and‑route framework for strong TFMs: RaBEL expands each scalar into compact localized RBF features (optionally exponent‑gated) to improve conditioning and shallow‑layer effective rank, while a reordered bidirectional block S‑>N‑>F aligns computation with the readout by aggregating cross‑sample context before feature mixing and using attention pooling. Together, these changes yield LimiX‑2M, a 2M‑parameter model that outperforms larger TabPFN‑v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs. These results highlight value‑aware tokenization and readout‑aligned routing as key levers for improving the accuracy‑‑efficiency trade‑off in TFMs. Model checkpoints and inference code are available at https://github.com/limix‑ldm‑ai/LimiX.

Authors:Luoyidi Zhou
Title: An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization
Abstract:
Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one‑dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR‑10 and CIFAR‑100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/zlyd‑CV/DeepLearning‑Empirical‑Studies.

Authors:Muhammad Hadi, Muhammad Jahangir, Talha Shafique, Muhammad Khuram Shahzad
Title: TITAN-FedAnil+: Trust-Based Adaptive Blockchain Federated Learning for Resource-Constrained Intelligent Enterprises
Abstract:
Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data heterogeneity arising from non‑IID distributions and decentralized security threats remain significant challenges, particularly in resource‑constrained enterprise environments. This paper presents TITAN‑FedAnil+, a Trust‑Based Adaptive Network for blockchain‑enabled federated learning in intelligent enterprises. The proposed framework introduces affinity propagation‑based adaptive clustered aggregation to identify and filter malicious updates without requiring prior knowledge of the number of attackers. In addition, GPU‑accelerated vectorization is employed to improve computational efficiency, while a signed state jump mechanism enables lightweight blockchain resynchronization. Experimental results demonstrate substantial reductions in memory overhead, achieving up to 81% savings across 50 communication rounds on constrained 8 GB edge devices compared with the baseline framework. The results indicate that TITAN‑FedAnil+ effectively improves robustness, scalability, and resource efficiency for secure federated learning deployments in intelligent enterprise environments.

Authors:Chen Chu, Bita Azarijoo, Li Xiong, Khurram Shafique, Cyrus Shahabi
Title: From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models
Abstract:
Recent large language models (LLMs) often appear to exhibit spatial reasoning ability; however, this capability is largely \emphsymbolic, arising from pattern matching over spatial language rather than true \emphgeometric reasoning over space. Because LLMs operate on discrete tokens, they lack native support for continuous spatial representations, explicit geometric computation, and structured spatial operators. To address this limitation, we introduce the \emphSpatial Language Model (SLM), the first multimodal LLM that treats location information as a first‑class modality and enables geometric spatial reasoning within the model's inference process. SLM directly operates on learned spatial representations rather than textual descriptions of spatial relations. To support effective training, we construct a \emphSpatial Instruction Dataset that aligns spatial representations, atomic geometric operations, and natural language instructions. We further propose a new benchmark named \emphSpatialEval, which is designed to evaluate spatial reasoning across attributes, distance, topology, and relative‑position tasks. Extensive experiments show that SLM significantly outperforms existing LLM‑based approaches that rely on symbolic reasoning via prompt engineering or textual abstraction, demonstrating the benefits of integrating geometric spatial representations for robust spatial reasoning. Our instruction dataset, evaluation benchmark, model training codes, and models' checkpoints can be found at: \hyperlinkhttps://github.com/chuchen2017/SLMhttps://github.com/chuchen2017/SLM.

Authors:Yanshun Zhao, Xiaoyu Peng, Jiamin Jiang, Congcong Zhu, Jingrun Chen
Title: MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers
Abstract:
Conventional patchified Transformers operate on uniform spatial partitions, distributing computational effort evenly across the domain irrespective of local features. This inflexible tokenization scheme is inherently limited in its ability to efficiently represent and process solutions to complex PDEs. To address this, we propose MeshTok, an adaptive mesh refinement (AMR)‑inspired tokenization and sequence modeling framework. This method selectively refines spatial regions exhibiting sharp gradients, transient features, or multiscale structures, generating a heterogeneous set of multiscale tokens defined on a fixed simulation grid. These tokens are processed within a unified Transformer sequence, enabling the model to simultaneously capture coarse‑grained global context and fine‑grained local details without requiring specialized architectural components. Although adaptive refinement moderately increases token count, it promotes a more targeted allocation of computational resources to physically informative regions, which we view as a practical inductive bias rather than a formal optimality guarantee. Experimental evaluations across multiple PDE families and benchmark datasets demonstrate that MeshTok consistently improves the efficiency‑accuracy trade‑off compared to uniform‑grid baselines. This suggests adaptive multiscale tokenization as a scalable and generalizable design principle for neural PDE modeling. Code is available at https://github.com/SCAILab‑USTC/MeshTok.

Authors:Cathy Liu
Title: Literature-Guided Minimax Optimization of Virtual Epilepsy Neurostimulation
Abstract:
Computational models of epilepsy promise patient‑specific treatment design, but most optimization workflows still search for parameters that perform well on average. In neuromodulation, this is a weak target: a protocol that improves the mean response can still fail in the patient whose network is least tolerant to stimulation. We present a literature‑guided minimax pipeline that couples PubMed‑scale hypothesis extraction, The Virtual Brain (TVB) Epileptor simulations, and large‑language‑model‑guided black‑box optimization. The optimizer proposes either intrinsic model‑control parameters or clinically interpretable external‑stimulation protocols; TVB evaluates each proposal across sampled virtual patients; and the objective maximizes worst‑case reward, defined as the negative variance of simulated seizure activity. In the intrinsic model‑control experiment, the best archived parameter set improved worst‑case reward from ‑0.5285 to ‑0.3182, a 39.8% gain over baseline. The clinical‑style external‑stimulation search produced a much smaller worst‑case improvement (1.7%), and a 20‑patient virtual cohort showed no aggregate benefit (p=0.9019), despite a 55% responder rate and a positive temporal‑lobe subgroup signal. The study should be read as an in silico proof of concept for robust, literature‑aware neurostimulation design, not as clinical evidence.

Authors:Julian Skirzynski, Harry Cheon, Shreyas Kadekodi, Meredith Stewart, Berk Ustun
Title: Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models
Abstract:
Concept bottleneck models predict outcomes from high‑level concepts detected in inputs. Although concepts provide a simple way to reap benefits from interpretability, very few datasets include concept labels. This limits researchers' ability to determine which problems are suitable for these models, isolate the factors that drive their performance or lead to failures, or uncover which algorithms perform well. In this paper, we develop synthetic benchmarks for concept‑bottleneck models, focusing on their two main use cases: decision support, in which models assist humans in making better decisions, and automation, in which models handle routine tasks without supervision. Our benchmarks can generate labeled datasets while controlling for properties that affect performance, including data modality, concept choice, annotation quality, and completeness. We demonstrate how the benchmarks can be used to evaluate representative classes of concept bottleneck models. Our demonstrations show how the benchmarks can diagnose failure modes and guide follow‑up testing.

Authors:Haojun Qiu, Kiriakos N. Kutulakos, David B. Lindell
Title: Efficient and Training-Free Single-Image Diffusion Models
Abstract:
We consider the problem of generating images whose internal structure ‑‑ defined by the distribution of patches across multiple scales ‑‑ matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed‑form denoiser, eliminating the need for neural network training. We integrate this patch‑based denoiser into an efficient, training‑free image diffusion model, and we describe how our method connects to classical patch‑based image restoration techniques. Our approach achieves state‑of‑the‑art generation quality and diversity compared to trained single‑image diffusion models, and we demonstrate applications, including unconditional image generation, text‑guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single‑image generation in one second, and gigapixel generation in minutes.

Authors:Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri
Title: DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities
Abstract:
The growing popularity and capacity of generative models have eroded the distinction between human and machine‑generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open‑source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first‑of‑its‑kind, extensible toolkit designed to provide a unified interface for AI‑generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state‑of‑the‑art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self‑contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi‑modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open‑source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

Authors:Christian Lysenstøen
Title: Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval
Abstract:
Retrieving the few past turns that answer a new query across long multi‑session histories is the retrieval bottleneck behind long‑term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano‑Memory, shows that scoring a session by the maximum query‑turn similarity (late interaction, "Turn Isolation Retrieval") beats mean‑pooled session embeddings. We do not claim that effect; we replicate it and ask what a training‑free, CPU‑only retrieval stage should add around it. We report four findings. (1) Fuse: score‑level fusion of the late‑interaction dense score with BM25, under a single leave‑one‑conversation‑out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p<1e‑4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5‑large‑v2), +11.2 pp over BM25. (2) An off‑the‑shelf web‑search cross‑encoder reranker over the fused top‑10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling‑operator ablation shows top‑k late interaction matches max‑similarity, but a naive smooth‑max (log‑sum‑exp) collapses for half the encoders. (4) The late‑minus‑early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval‑S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per‑category analysis frames the gain as a division of labor: dense late interaction helps most on multi‑hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training‑free retrieval recipe, not the late‑interaction retriever itself (Nano‑Memory's). We make no claim to a complete memory architecture; this is a retrieval‑stage study.

Authors:Youqi Wu, Mohammad Jalali, Farzan Farnia
Title: KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models
Abstract:
Vision‑language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not explain how their representations differ structurally. In this work, we study this problem through the task of Contrastive Embedding Clustering: identifying sample subsets that are weakly clustered under one representation but strongly clustered under another. We propose \emphKernel Optimization for Discrepancy Analysis (KODA), a kernel‑based framework for contrastive representation comparison and alignment. KODA constructs unified multimodal kernels through modality‑wise kernel composition and formulates discrepancy discovery as a constrained optimization problem that searches for coherent structures in one representation while suppressing coherence in a reference representation. This yields interpretable discrepancy directions associated with specific sample subsets and modality interactions. To scale KODA to large vision‑language datasets, we develop randomized low‑dimensional approximations of joint kernels using random projections, including Random Fourier Features for shift‑invariant kernels. Empirically, KODA identifies consistent and interpretable discrepancy structures across vision‑language representations and provides sample subsets for representation alignment. The code is available at https://github.com/yokiwuuu/KODA.

Authors:Michael J. Bommarito
Title: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments
Abstract:
File‑type classification underlies many workflows like malware triage, forensic carving, packet inspection, and storage indexing. Learned systems such as Google's Magika assume whole‑file access at a known offset, so they break on the inputs many of these tasks actually produce, like a single packet payload, a header‑less carved fragment, a random disk block, or a chunked upload. We introduce MimeLens, a family of small BERT‑style encoders pretrained on binary content from windows sampled at a uniformly random offset within each file, with no privileged head‑of‑file position, in standard‑ and short‑context variants. A byte chunk goes in from anywhere in a file, no header needed and no fixed size; out comes one of libmagic's 125 MIME labels. On the clean head of complete files, MimeLens beats Magika v1.1 by +10.7 pp top‑1 on libmagic‑labeled data, and it keeps classifying where Magika cannot: from a single mid‑stream UDP packet, and more than twice as accurately as libmagic and Magika on random mid‑file disk blocks. The cost is latency: MimeLens runs roughly one to two orders of magnitude slower per sample on CPU than Magika, though it matches on consumer GPUs or in batch. All trained checkpoints are released on Hugging Face (mjbommar/mimelens‑001‑).

Authors:Mansoor Ahmed, Huirong Chai, Haoxin Wang, Hemanth Venkateswara, Murray Patterson
Title: EpiFormer: Learning Antigen-Antibody Interactions for Epitope Prediction via Geometric Deep Learning
Abstract:
Antibodies neutralize foreign antigens by binding to specific surface regions called epitopes. Computational epitope prediction is critical for understanding immune recognition and guiding antibody engineering. However, existing methods face three fundamental challenges: antibody‑aware models encode each chain independently and combine them only at a late stage, failing to capture co‑dependent structural features that define binding interfaces, whereas severe class imbalance and scarcity of known antibody‑antigen complexes render standard training objectives ineffective. We propose EpiFormer, a general encoder‑decoder framework that addresses these challenges jointly. Our key design principle is interleaved cross‑attention within GNN encoding layers, enabling bidirectional antigen‑antibody information flow throughout representation learning rather than only at the output. This early‑fusion principle is backbone‑agnostic, providing consistent gains across GNN architectures from simple GCNs to equivariant models. We further show that sparsity‑aware objectives are effective when paired with early‑fusion architectures for the epitope prediction task. EpiFormer improves over the previous best method by over 40% in F1 score on standard benchmarks, demonstrating generalizability and cross‑dataset transferability. Notably, EpiFormer discovers known biological principles as emergent behaviors of end‑to‑end training, where the learned cross‑attention gates favor antigen‑to‑antibody information flow, consistent with the asymmetric roles of the two chains at the binding interface, and the model's preference for geometric over evolutionary features aligns with the established finding that epitope residues are not evolutionarily conserved. The source code is available at: https://github.com/mansoor181/epiformer.git

Authors:Shiqiao Zhou, Holger Schöner, Zipeng Wu, Edouard Fouché, IAG Wilson, Shuo Wang
Title: Stationarity-Aware Retrieval-Augmented Time Series Forecasting
Abstract:
Time series forecasting relies on historical patterns, but real‑world series often exhibit non‑stationarity and regime shifts that challenge fully parametric forecasters. Inspired by Retrieval‑Augmented Generation (RAG), recent work augments forecasters by retrieving relevant historical segments and using them as external evidence at inference time. However, due to the intrinsic non‑stationarity of real‑world time series, a highly similar past segment does not necessarily imply a similar future, rendering similarity‑only retrieval brittle and prone to redundancy. We propose Stationarity‑Aware Retrieval‑Augmented Time Series Forecasting (SARAF), a framework that adaptively balances relevance and diversity in retrieval. SARAF first forms a candidate pool via temporal similarity with time‑aligned enhancement, then applies a diversity‑aware selection strategy to cover heterogeneous historical regimes, with the diversification strength automatically modulated by dataset‑level stationarity. Moreover, SARAF uses stationarity‑aware aggregation to fuse the retrieved futures. Extensive experiments on eight real‑world datasets show that SARAF achieves competitive forecasting performance and improves average accuracy and robustness over strong baselines, with particularly clear benefits under challenging non‑stationary settings. Code: https://github.com/ShiqiaoZhou/SARAF.

Authors:Pragya Sharma, Brian Wang, Mani Srivastava
Title: CADET: A Modular Platform for Evaluating Distributed Cooperative Autonomy in Connected Autonomous Vehicles
Abstract:
Deep learning models are increasingly central to autonomous vehicle (AV) pipelines, yet their integration has traditionally followed a monolithic design where perception, planning, and control execute on a single onboard computer. This design overlooks the emerging paradigm of cooperative autonomy, where vehicles interact with roadside units (RSUs), edge servers, and cloud‑hosted intelligence through vehicle‑to‑everything (V2X) connectivity. Cooperative perception and control improve safety and efficiency, but also introduce systems‑level challenges: network latency, compute heterogeneity, and multi‑tenant contention, all critically affect real‑time decision‑making. These challenges are further amplified by the increasing reliance on large foundation models, whose scale necessitates cloud deployment. We present CADET (Cooperative Autonomy through Distributed Experimentation Toolkit), a modular platform for systematic and reproducible evaluation of distributed cooperative autonomy systems under realistic deployment conditions. CADET decouples the AV stack into composable modules that can be flexibly deployed across vehicles, infrastructure, and edge/cloud tiers. The framework integrates state‑of‑the‑art models, incorporates trace‑driven network and workload emulation, and provides synchronized model‑, system‑, and task‑level instrumentation. Through V2V and V2I experiments, we show that distributed deployment choices fundamentally shape safety, with V2V intent packets outperforming cloud‑based perception and RSU‑assisted perception sustaining safety until overloaded by concurrent requests. Although designed for AV pipelines, CADET also supports dataset‑driven experimentation, enabling systems and ML researchers to benchmark distributed inference workloads independently of full vehicle simulation. CADET is open source, with code and demo available at https://nesl.github.io/cadet‑web.

Authors:Eduardo Terrés-Caballero, Herke van Hoof
Title: A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
Abstract:
The Boolean Task Algebra (BTA) provides a principled framework for zero‑shot task composition in reinforcement learning by equipping goal‑reaching tasks with Boolean operations. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q‑value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant. Building on this observation, we introduce a goal‑set‑based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance. Experiments across tabular, visual, function‑approximation, and continuous‑control domains show that learning additional base tasks does not yield better performance. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals. Code is available at https://github.com/EduardoTerres/bta_paper.

Authors:Liulu He, XuanAng Liu, Juntao Liu, Taolue Feng, Ting Lu, Chunsheng Gan, Zhiyv Peng, Yuan Du, Huanrui Yang, Yijiang Liu, Li Du
Title: LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
Abstract:
Existing quantization methods are fundamentally limited by rigid, integer‑based bit‑widths (e.g., 2, 3‑bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit‑width control for true Pareto‑optimal deployment. The core innovation is a ``lift‑then‑project" mechanism which approximates low‑dimensional weight vectors by projecting a simple 1‑bit lattice from a higher‑dimensional ``lifted" space. Crucially, the effective bit‑width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit‑width to be tuned quasi‑continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non‑uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1‑bit uniform quantizers, retaining hardware‑friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state‑of‑the‑art 2‑bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.

Authors:Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou
Title: Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation
Abstract:
In embodied vision‑language decision making tasks such as robotic manipulation and navigation, Vision‑Language and Vision‑Language‑Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long‑term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task‑relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one‑step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse‑to‑fine focus plan generation method for VLMs leveraging their long‑term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub‑problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future‑item.github.io/SceneDiver.

Authors:Dat Thanh Tran, Van Khu Vu, Yining Ma
Title: Beyond Static Priors: Dynamic Neural Guidance for Large-Scale Ant Colony Optimization
Abstract:
Neural‑guided Ant Colony Optimization (ACO) suffers from a fundamental training‑inference misalignment: policies are typically trained to generate static priors (e.g., heatmaps), yet deployed to guide iterative, long‑horizon search processes. In this paper, we present DyNACO, a novel framework that achieves dynamic neural guidance by periodically observing the pheromone distribution and the incumbent solution. To make DyNACO tractable at scale, we pair the policy with a perturbation‑based ACO backend and a scope‑restricted refinement mechanism that jointly ensure efficacy and stable credit assignment. On TSP, DyNACO scales to 100,000‑node instances and outperforms neural baselines while often reducing total runtime compared to the unguided solver. We extend DyNACO to CVRP via a capacity‑aware backend, consistently improving the unguided baseline with less than 1% neural overhead. We further provide in‑depth analysis validating the model's generalization capabilities and elucidating why dynamic guidance outperforms static priors. Our work underscores the necessity of aligning neural training with iterative search dynamics in learning‑guided optimization. The code is available at https://github.com/shoraaa/DyNACO.

Authors:Thanh Luong Tuan, Abhijit Sanyal
Title: Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification
Abstract:
Pre‑deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post‑deployment monitoring, human‑in‑the‑loop controls, and prompt‑level guardrails offer limited assurance once an agent is operating in production. We present an ontology‑grounded verification framework ‑‑ to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology‑to‑scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine‑verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry‑by‑regulatory‑regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary‑source regulatory requirements and 25 injected faults. Ontology‑grounded generation significantly outperformed the dominant persona‑based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e‑6); transparently, its advantage over plain and retrieval‑augmented prompting did not survive Bonferroni correction. Cross‑validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona‑versus‑ontology pattern. The framework offers a reproducible, regulation‑grounded route to pre‑deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

Authors:Yifeng Liu, Shiyuan Zhang, Yifan Zhang, Quanquan Gu
Title: Self-Distilled Policy Gradient
Abstract:
On‑policy self‑distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse‑reward reinforcement learning. Actually, it can be instantiated as an auxiliary full‑vocabulary student‑to‑teacher reverse Kullback‑Leibler divergence loss. We therefore propose SDPG, a self‑distilled policy‑gradient framework that combines group‑relative verifier advantages with normalized standard deviation, exact full‑vocabulary on‑policy self‑distillation, as well as reference‑policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self‑distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.

Authors:Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis
Title: Do Transformers Need Three Projections? Systematic Study of QKV Variants
Abstract:
Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q‑K=V (shared key‑value), b) Q=K‑V (shared query‑key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q‑K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q‑K=V with GQA‑4 yields 87.5% cache reduction, while Q‑K=V + MQA achieves 96.9%, enabling practical on‑device inference. We show that Q‑K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low‑rank regime, whereas Q=K‑V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip‑Inc/Do‑Transformers‑Need‑3‑Projections

Authors:Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman
Title: Neuron Populations Exhibit Divergent Selectivity with Scale
Abstract:
We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non‑Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power‑law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain‑specialized with scale and illustrate their selectivity through a targeted data‑filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron‑level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

Authors:Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang
Title: Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
Abstract:
Reward models (RMs) provide critical feedback signals for LLM post‑training, notably in reinforced fine‑tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule‑based verifiers, ground‑truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill‑RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward‑Evaluation Skill. By treating reward computation as a structured agentic task, Skill‑RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best‑of‑N selection and reinforcement learning, demonstrate that Skill‑RM consistently outperforms traditional judge baselines. Our findings suggest that Skill‑RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen‑Applications/Skill‑RM.

Authors:Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo, Alexander Robey, Na Li, Yebin Wang, Changliu Liu
Title: VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring
Abstract:
As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount ‑‑ physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision‑Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real‑time safety interventions when dangerous actions are predicted. VLESA addresses intent‑dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal‑conditioned safety annotations is introduced, enabling a goal‑conditioned safety Q‑filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent‑action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV‑2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground‑truth frame compared to baselines, while the GRPO‑trained Q‑filter improves action safety by over 41 percentage points through goal‑conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.

Authors:Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu
Title: PointAction: 3D Points as Universal Action Representations for Robot Control
Abstract:
Video‑Action Models (VAMs) leverage the broad visual dynamics captured by pre‑trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB‑only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine‑grained spatial constraints under‑specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point‑based 4D modeling. PointAction fine‑tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task‑relevant scene geometry. These point dynamics serve as a structured, embodiment‑agnostic action interface, which a diffusion‑based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB‑only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state‑of‑the‑art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

Authors:Dan Jacobellis, Neeraja J. Yadwadkar
Title: SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction
Abstract:
In robotics systems, vast amounts of visual data are easily captured at high resolution using low‑cost, low‑power hardware. Yet, limited bandwidth and on‑device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate‑distortion trade‑off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One‑Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision‑language‑based perception. Using SEAOTTER, we train both general‑purpose and task‑aware transcoding pipelines for a pre‑trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top‑1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT‑SysML/seaotter .

Authors:Niccolò Perrone, Fanny Lehmann, Stefania Fresca, Filippo Gatti
Title: Correcting Neural Operator Spectral Bias via Diffusion Posterior Sampling with Sparse Observations
Abstract:
Neural operator surrogates (NO) approximate PDE solutions orders of magnitude faster than numerical solvers, but suffer from spectral bias: high‑frequency content is systematically attenuated, limiting reliability where fine‑scale structure matters. Sparse sensor measurements of the field are often available too, offering pointwise accuracy without spectral distortion but covering only a small fraction of the domain. We address this by treating NO predictions as auxiliary observations in a diffusion posterior sampling framework. Our method, FreqNO‑DPS (https://github.com/niccoloperrone/FreqNO‑DPS), combines an unconditional score‑based diffusion prior, trained on high‑fidelity simulations, with diffusion posterior sampling (DPS) conditioned on sparse observations and guided by a frozen neural operator. Naive integration reintroduces the surrogate's spectral bias; we resolve this with a closed‑form, spectrally shaped guidance score that weights the surrogate by its frequency‑dependent accuracy and needs no denoiser backpropagation. A distribution‑free analysis bounds the approximation error across the frequency‑diffusion‑time plane and shows the guidance's frequency dependence is preserved regardless of distributional assumptions. On 3D elastic wavefield prediction at 5% and 2% sensor coverage, the method reaches near‑zero spectral bias across all bands, where both the surrogate and sensor‑only DPS show systematic high‑frequency attenuation. Isotropic guidance, the natural baseline, improves pointwise accuracy but carries the bias into the posterior nearly intact, confirming that frequency‑dependent calibration is essential, not merely beneficial. The framework needs only paired surrogate/reference data and exploits no problem‑specific structure beyond the residual's approximate spectral diagonality, verifiable for new surrogates via the coherence diagnostic we provide.

Authors:Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia
Title: Value-Aware Stochastic KV Cache Eviction for Reasoning Models
Abstract:
Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key‑value pairs from the cache, yet they often yield worse accuracy than selection‑based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value‑aware Stochastic KV Cache Eviction (VaSE), a training‑free recipe that protects large‑magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

Authors:Thomas Maillart, Thibaut Chataing, David Dosu, Paul Bagourd, Julian Jang-Jaccard, Alain Mermoud
Title: Forecasting Conceptual Diffusion in Science: The Case of Quantum Computing
Abstract:
Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co‑occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity‑aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum‑computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable (R^2 up to 0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top‑ranked target across fields (R^2_test ~ 0.60‑0.87), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum‑computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity‑based signals of cross‑domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation‑oriented policy analysis in rapidly evolving research fields.

Authors:Kieran A. Murphy, Shameen Shrestha
Title: Attribution via Distributional Paths for Information Revelation
Abstract:
Feature attribution methods explain predictions by assigning importance scores to input features. Path‑based methods such as Integrated Gradients are especially appealing because they satisfy completeness: attributions sum to the change in model output between a reference state and the input. Yet most path methods define this trajectory in input space, explaining a model through pointwise perturbed inputs along a chosen path. An input‑space path integrates the model's raw response at each point it passes through, with no control over the resolution at which a feature is queried; the early, baseline‑adjacent part of the trajectory contributes to the explanation on equal footing with the input itself. Here, we lift path attribution from input space to a space of structured probe distributions around the example of interest, and call our method Reveal‑IG. Rather than traversing raw input values, Reveal‑IG progressively reveals information about the input and attributes changes in the model's expected output along this distributional path. The result is a path‑attribution framework that retains completeness with respect to the expected model response, and naturally accommodates multiscale image probes and feature‑wise uncertainty in tabular data. Synthetic diagnostics show that Reveal‑IG avoids path artifacts that affect input‑space methods, and across ImageNet classification and tabular regression it produces stable, signed attributions ‑‑ leading on metrics that use attribution sign while remaining competitive on the rest.

Authors:Thomas Maillart, Thibaut Chataing, Ntorina Antoni, David Dosu, Paul Bagourd, Julian Jang-Jaccard, Alain Mermoud
Title: Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics
Abstract:
We introduce an explainable machine‑learning approach that forecasts the structural precursors of scientific breakthroughs ‑‑ the emergence and intensification of links between research concepts ‑‑ by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two‑stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link‑existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparative validation across four technology and biomedical domains yields ROC‑AUC in [0.954, 0.967] at all horizons without re‑tuning, exceeding the roughly 0.90 of prior models, while every forecast rests on structural, auditable features rather than opaque embeddings. Classification performance is high (AUC about 0.95) and regression remains stable (RMSLE 0.45 to 0.6 over one to five years). Feature attribution shows that structural factors ‑‑ particularly Adamic‑Adar similarity and degree‑based Hadamard measures ‑‑ consistently drive accuracy, suggesting that breakthrough‑relevant recombinations emerge in tightly connected sub‑networks. Two expert‑anchored cases, quantum annealing and AI‑enabled quantum architectures, show the model surfacing technological convergence consistent with expert expectations. We then outline a three‑layer decision architecture ‑‑ detection, expert translation, institutional integration ‑‑ that turns these forecasts into evidence‑based research strategy and policy, anchored in open data and explainable features.

Authors:Haowei Han, Yuxiang Wang, Guojia Wan, Hao Wang, Shanshan Feng, Hao Huang, Jiawei Jiang, Xiao Yan
Title: Text-attributed Graph Condensation via Text Selection and Attribute Matching
Abstract:
Text‑Attributed Graph (TAG) is an important type of graph structured data, where each node has a text description. TAG models usually train a Graph Neural Network (GNN) and language model jointly, which leads to high space and time consumption, especially on large datasets. To mitigate this, we propose TAGSAM, a condensation method that compresses TAGs while preserving training accuracy. TAGSAM comes with two key designs, i.e., subgraph text Selection and Attribute similarity Matching, which compress the text description and graph topology of TAG, respectively. For the texts, subgraph text selection selects and merges representative text chunks from multiple related text descriptions by maximizing mutual information. For the graph topology, popular condensation methods based on Matching Training Trajectories (MTT) suffer from high variance, which hinders accuracy. Our attribute similarity matching mitigates this issue by aligning stable similarity matrices. We evaluate TAGSAM against six state‑of‑the‑art baselines, where it showcases superior performance. For the same compressed size, TAGSAM improves upon the best‑performing baseline by an average of 4.9% in accuracy. Furthermore, it maintains competitive training accuracy even when the TAG is condensed to just 1% size. Our code is available at https://github.com/SundayVHan/TAGSAM

Authors:Georgios Tsoumplekas, Stella Bounareli, Vasileios Argyriou
Title: Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting
Abstract:
Low‑Rank Adaptation (LoRA) successfully enables personalization in text‑to‑image generation by adapting pre‑trained diffusion models to specific visual concepts and styles. However, extending such models to multi‑concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi‑concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W‑Switch and W‑Composite, that employ a prompt‑aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image‑based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real‑world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state‑of‑the‑art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image‑based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt‑Aware‑Multi‑LoRA‑Composition.

Authors:Zhengbao He, Ruiqi Ding, Zhehao Huang, Ruikai Yang, Tao Li, Xiaolin Huang
Title: Compress then Merge: From Multiple LoRAs into One Low-Rank Adapter
Abstract:
Low‑rank adaptation (LoRA) enables parameter‑efficient specialization of foundation models, but the proliferation of task‑specific adapters fragments capabilities across many adapters, complicating reuse and deployment. We study the problem of merging T LoRAs into a single rank‑r LoRA, thereby preserving the benefits of low‑rank structure. Existing Merge‑then‑Compress pipelines treat the rank constraint as an afterthought: they merge adapters in the full parameter space, then compress the merged result to rank r via truncated SVD. However, full‑parameter merging may destroy the low‑rank structure, making it difficult for subsequent compression to recover an effective rank‑r LoRA. We propose Compress‑then‑Merge (CtM), a reversed pipeline that enforces the rank‑r bottleneck before merging: CtM computes shared r‑dimensional subspaces using only the LoRA weights to capture cross‑adapter common structure, projects each adapter into the shared subspaces to obtain r× r coordinates, and then applies standard merging rules in this reduced space. CtM guarantees a rank‑r LoRA by construction, avoiding post‑hoc truncation, and enables efficient computation in the core space spanned by concatenated LoRA factors. Experiments across multiple models and tasks show that CtM consistently outperforms existing single‑LoRA‑output baselines while narrowing the performance gap to full‑parameter merging methods.

Authors:Salih Bora Ozturk, Alexander Pfefferle, Frank Hutter
Title: Speedrunning Tabular Foundation Model Pretraining
Abstract:
Pretraining cost is a major bottleneck for research on tabular foundation models, slowing the iteration cycle for new architectures, priors, and optimization ideas. Yet the community lacks a simple way to compare and accumulate pretraining speedups. We introduce a community speedrun for nanoTabPFN: contributors modify a single‑file training script and compete to reach a fixed downstream ROC AUC target on subsampled TabArena using one NVIDIA L40S GPU. The current best record reaches the target in 0.92 minutes, an 81x speedup over the 74.32 minute baseline while using 22x fewer synthetic datasets. The speedrun format provides a simple protocol for the community to add, verify, and stack pretraining improvements, with the leaderboard open to contributions. Code and records are available at https://github.com/borawhocodess/modded‑nanotabpfn.

Authors:Liuyuan Wen, Xun Zhu, Lihao Huang, Wenbin Li, Yang Gao
Title: The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models
Abstract:
Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between internal computation and discrete output. By analyzing the residual stream geometry during multi‑operand addition, we identify the Iso‑Raw‑Sum Trajectory (IRST), a geometric structure where representations are anchored by semantic digits and modulated by continuous carry fibers. We propose the Noisy Quantization Model to explain this geometry, framing arithmetic errors as Geometric Slippages caused by internal neural noise pushing a continuous, latent Carry Potential across quantization thresholds. This geometric framework further elucidates Probe Versatility, explaining how lightweight probes can disentangle coexisting latent signals (such as ground truth versus hallucination) from a single activation vector. Finally, we validate these insights through a geometric consistency check method that effectively detects and corrects these quantization failures during inference. Our code is available at https://github.com/RL‑MIND/Shape‑of‑Addition.

Authors:Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal
Title: Building Reliable Long-Form Generation via Hallucination Rejection Sampling
Abstract:
Large language models (LLMs) have achieved remarkable progress in open‑ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long‑form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference‑time hallucination mitigation framework, named Segment‑wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long‑form text. Our method enables models to self‑correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long‑form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination‑rejection‑sampling.

Authors:Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng
Title: Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
Abstract:
Test‑time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label‑free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under‑explored yet critical in label‑free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label‑free setting is highly non‑trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in‑depth empirical analysis, we discover the root causes hindering performance: pseudo‑label estimations for low‑confidence samples have a high probability of being incorrect, while candidate answers for high‑confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL‑CoCoV (Test‑Time Reinforcement Learning with Confidence‑Conditioned Verification), a novel confidence‑adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL‑CoCoV employs a confidence‑conditioned mechanism: for high‑confidence samples, it bootstraps verifier and applies an exploration‑enhancing reward to prevent diversity collapse; for low‑confidence samples, it delegates pseudo‑label selection to the verifier to filter incorrect pseudo‑labels; and for medium‑confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL‑CoCoV outperforms the best competing methods across 6 widely‑recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.

Authors:Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu
Title: CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery
Abstract:
Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM‑augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data‑centric algorithm can make results sensitive to algorithm‑specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near‑perfect accuracy on the filtered consensus edges. Second, a trust‑calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation‑free trust calibration procedure, which is then utilized to govern a trust‑weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data‑centric and LLM‑augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

Authors:Timo Osterburg, Stefan Schütte, Torsten Bertram
Title: Learned Non-Maximum Suppression for 3D Object Detection
Abstract:
Post‑processing is a critical stage in LiDAR‑based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non‑maximum suppression (NMS) by leveraging relations among detections. D2D‑Rescore employs transformer‑based detection‑to‑detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's‑eye view. A metric‑aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection‑level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst‑tu‑dortmund/learned‑3d‑nms .

Authors:Vadim Porvatov, Andrey Dukhovny, Andrey Lange
Title: How Many Trees in a Random Forest? A Revisited Approach with Plateau Search and Optuna Integration
Abstract:
Hyperparameter optimization (HPO) for Random Forest faces a specific difficulty in tuning the number of trees: the predictive score typically improves monotonically with ensemble size, so standard methods such as Tree‑structured Parzen Estimator (TPE) and Hyperband require a predefined search range and often drive the estimate toward its right boundary. Early‑stopping strategies avoid fixing such a range, but can be sensitive to score noise and prone to premature stopping. To address this, we propose an integrated triplet‑based plateau‑search algorithm that removes the number of trees from the direct TPE search space and still exploits information accumulated across HPO trials. The method adaptively tracks a near‑minimal sufficient ensemble size by monitoring relative changes in the out‑of‑bag (OOB) score across a triplet of forest sizes and shifting this triplet accordingly. This yields an automated and user‑interpretable procedure based on a tolerance parameter. We also provide a theoretical analysis: we relate the proposed relative OOB‑score criterion to the gap between the current and limiting scores, and derive an asymptotic variance estimate for the corresponding OOB‑based absolute relative difference. Experiments show that the selected number of trees can differ substantially from the common heuristic: for most classical benchmark datasets it is smaller, whereas for some high‑dimensional bioinformatics datasets, such as Arcene and Dorothea, it is larger. The source code and reproducible experiments are available at https://github.com/lange‑am/rf_plateau_hpo.

Authors:Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov, Gleb Molodtsov, Nail Bashirov, Artem Tsedenov, Aleksandr Beznosikov
Title: Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression
Abstract:
Post‑training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing studies evaluate these methods in narrow settings, leaving unclear whether tensorization is effective at large‑scale deployment. We systematically evaluate tensor compression across dense and MoE architectures, establishing performance trade‑offs grounded in both empirical analysis and theoretical analysis. We identify a fundamental mismatch between the shared subspaces assumed by tensor decompositions and the heterogeneous representations learned by modern LLMs, thereby delineating their practical limits and clarifying their viable role in large‑scale deployment. The code is available at https://github.com/brain‑lab‑research/TT‑LLM.

Authors:Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli
Title: KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
Abstract:
Test‑time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory‑bottlenecked during long‑horizon decoding, as the KV‑cache grows. KV‑cache quantization can help improve this, but current methods are evaluated under prefill‑like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration‑free KV‑cache quantizer that applies a Hadamard rotation followed by a dual‑scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token‑scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state‑of‑theart for KV‑cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2‑bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei‑csl/KVarN

Authors:KeXiang Mao, FanCheng Li
Title: Flicker-DDPM: Accelerating Denoising Diffusion via 1/f Colored Noise Injection
Abstract:
We propose a novel diffusion model, Flicker‑DDPM, which incorporates flicker (1/f) noise inspired by self‑organized criticality (SOC), a widely observed phenomenon in natural systems. Unlike denoising diffusion probabilistic models (DDPMs), which employ isotropic white noise in the forward process, Flicker‑DDPM adopts colored noise with power‑law spectra to better match the spectral statistics of natural images, whose power spectra typically follow P(k) proportional to 1/k^α. To this end, we develop a colored‑noise module based on a spatial correlation kernel, σ(d) = (d + 1)^‑η, and theoretically establish that adjusting η controls the spectral exponent α of the generated 1/fα noise, enabling adaptation to datasets with diverse spectral characteristics. On CIFAR‑10, Flicker DDPM matches or surpasses the generation quality of a standard DDPM baseline using 3.33 times fewer sampling steps, with negligible additional computational cost per step. We further develop a frequency‑domain linear theory demonstrating that spectrally matched colored noise linearizes the reverse trajectory, theoretically explaining the observed sampling acceleration.

Authors:Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang
Title: When Model Merging Breaks Routing: Training-Free Calibration for MoE
Abstract:
Model merging has emerged as a cost‑effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture‑of‑Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non‑linear softmax and discrete Top‑k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load‑balancing constraints imposed during MoE pretraining. Because fine‑tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian‑Aware Router Calibration (HARC), a training‑free framework that leverages second‑order curvature information to realign the merged router. This approach admits a closed‑form solution that can be efficiently solved using a matrix‑free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.

Authors:Can Lv, Mingju Chen, Heng Chang, Shiji Zhou
Title: Mitigating False Credit Propagation: Probabilistic Graphical Reward Aggregation for Rubric-Based Reinforcement Learning
Abstract:
Rubric‑based rewards are increasingly used for open‑ended language model post‑training, but criterion‑level scores are often aggregated as independent utilities. This flat scalarization ignores rubric‑specified prerequisite and activation relations among criteria, allowing reward or penalty to be counted even when the condition that licenses it is absent. We call this structural reward‑aggregation failure False Credit Propagation (FCP). To address this limitation, we propose \ourname (Graphical Event Aggregation for Rubric rewards), a probabilistic graphical framework for dependency‑aware rubric aggregation. \ourname models each criterion outcome as a latent Bernoulli event in a typed rubric graph, propagates soft suppression from unsupported parent events to their children, and aggregates the resulting event probabilities into a normalized expected signed utility. This yields a linear‑time reward computation that can be plugged into standard rubric‑based RL pipelines without changing the outer optimization algorithm. Experiments on HealthBench, WritingBench, and PLawBench with two policy backbones show that \ourname consistently improves over flat aggregation and deterministic gating, achieving relative gains of up to 15.5% over flat aggregation. FCP diagnostics further show that \ourname reduces leakage by 96.5% relative to flat aggregation while preserving more licensed downstream utility than deterministic gating. Our code is publicly available at https://github.com/LvCan926/GEAR.

Authors:Daniil Krasnoproshin, Maxim Vashkevich
Title: Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection
Abstract:
Speech emotion recognition is an important component of modern human‑computer interaction systems. However, many state‑of‑the‑art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM‑SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM‑based framework. Evaluated on the RAVDESS dataset under strict speaker‑independent partitioning, the proposed model outperforms conventional attention‑based LSTM baselines and several previously reported CNN‑ and hybrid CNN‑LSTM architectures in terms of unweighted average recall (UAR). The best‑performing variant (ResLSTM‑SA‑h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large‑scale self‑supervised alternatives, thereby enabling efficient deployment on edge devices and real‑time voice assistants. The source code is available at https://github.com/Mak‑Sim/ResLSTM‑SER.

Authors:Gurvan Richardeau, Gohar Dashyan, Erwan Le Merrer, Gilles Tredan
Title: FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences
Abstract:
Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instance‑level parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance‑level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance‑level fingerprinting, a regulator‑oriented paradigm that distinguishes configurations of the same LLM. Our method FLIPS, exploits biases in generated binary random sequences to reach 96% (closed‑set) and 90% (open‑set, where some targets are unknown) identification accuracy across 237 model instances, versus 35% for the adapted LLMmap baseline. This shows that instance‑level fingerprinting is both necessary for regulation and practically feasible. Code available at https://github.com/GurvanR/FLIPS‑LLM‑Instance‑Fingerprinting.

Authors:Sungwon Kim, Juho Song, Seungmin Shin, Guimok Cho, Sangkook Kim, Chanyoung Park
Title: EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs
Abstract:
Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field, which is essential for PDE dynamics, computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)‑transformed training samples. Consequently, our method robustly models coordinate‑invariant physical laws on complex irregular 3D geometries. Our code is available at https://github.com/sung‑won‑kim/EqGINO

Authors:Alston Lo, Luka Mucko, Austin H. Cheng, Andy Cai, Alastair J. A. Price, Wojciech Matusik, Alán Aspuru-Guzik
Title: Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching
Abstract:
Organic crystal structure prediction (CSP) is a requirement for computational modelling of organic solids, but traditionally costs several CPU‑years per molecule. Generative models such as OXtal dramatically reduce this cost by sampling stable organic crystal structures directly. However, OXtal forgoes explicit lattice parametrization in favour of modelling large crops of the bulk material with expensive triangle layers, which can incur a computational cost of minutes per molecule. In this paper, we reduce this to seconds with Clari, a large‑scale flow matching model that generates redundancy‑free unit cells and replaces triangle layers with pure pair‑bias attention. Clari requires only atom types and bonds as input and does not need an RDKit‑sanitizable input molecule, which expands its applicability to challenging chemistries such as fullerenes, metal complexes, and atom clusters. We further ablate key design choices such as auxiliary losses, timestep distributions, noise priors, and self‑conditioning. On OXtal's test sets, we surpass OXtal's solve rate while obtaining a speedup of 15‑30×. Because Clari also models explicit hydrogens, it supports inference‑time scaling via direct energy ranking, without any decoration or relaxation step. When generating 150 crystals and selecting the top‑30 by energy, we further improve solve rate while maintaining a speedup of 5‑8×. We also introduce the CSD Teaching Subset as a new test split of diverse and complex molecules for future benchmarking. Our contributions enable CSP within seconds, making large‑scale virtual screening of organic solids practical. Code is available at https://github.com/aspuru‑guzik‑group/clari.

Authors:Zirui Yan, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri, Ali Tajer
Title: Multi-component Causal Tracing in Large Language Models
Abstract:
Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single‑component or single‑layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi‑layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi‑component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi‑component‑causal‑tracing.

Authors:Aqsa Naseer, Maryam Bibi, Syeda Samiya Urooj, Muhammad Khurram Shahzad
Title: ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements
Abstract:
Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform‑based Probabilistic Shape Regularization Extractor (WT‑PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance‑based knowledge distillation to achieve robust cross‑domain segmentation. This study systematically examines improvements to the WT‑PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per‑pixel binary cross‑entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain‑adaptive augmentation including random erasing, gamma correction, and salt‑and‑pepper noise; (2) a hybrid BCE and Dice loss function for improved edge‑aware segmentation under noisy conditions; (3) a curriculum‑based Dice weight scheduling strategy; and (4) command‑line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic‑disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch‑5 Dice score of 0.939. These results indicate that training‑level improvements can provide consistent performance gains without modifying the underlying WT‑PSE architecture.

Authors:Phillip Jiang
Title: RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases
Abstract:
Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains challenging due to their multi‑table, heterogeneous, and temporal structure. Relational Deep Learning (RDL) addresses this by representing databases as heterogeneous graphs and applying graph neural networks (GNNs) directly. RelBench v2 recently introduced autocomplete tasks ‑‑ a practically motivated task type where the goal is to predict an existing column value from relational context, analogous to an intelligent form‑filling assistant. We propose RelGT‑AC (Relational Graph Transformer for Autocomplete), extending the RelGT architecture with three targeted contributions: (1) a column masking strategy that prevents trivial solutions by masking the target column during subgraph encoding; (2) a unified task head supporting binary classification, multiclass classification, and regression autocomplete tasks within a single model; and (3) a TF‑IDF text encoder that automatically detects and encodes free‑text columns, recovering strong lexical signal that categorical encoders discard. Across 7 tasks spanning 3 RelBench v2 datasets (rel‑trial, rel‑f1, rel‑stack), RelGT‑AC outperforms the GraphSAGE baseline on all 3 regression autocomplete tasks and achieves up to +10 AUROC points on text‑heavy eligibility tasks via the TF‑IDF encoder.

Authors:Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li
Title: Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference
Abstract:
Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast‑dLLM addressed this with KV caching and confidence‑guided parallel decoding, but its decoding theory uses a homogeneous high‑confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose Fast‑dLLM++, a training‑free extension that introduces \emphFréchet profile decoding: selecting parallel commit sets from the full sorted confidence profile rather than a single worst‑case confidence. The resulting rule is a heterogeneous‑confidence generalization of Fast‑dLLM's factor selector and it recovers the previous rule exactly in the equal‑confidence case and adds a provable \emphheterogeneity bonus when the selected tokens have uneven confidences. Fast‑dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop‑in replacement for existing Fast‑dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA‑8B model show that the theoretical improvement translates directly into empirical gains: profile‑aware selection improves the accuracy‑‑throughput frontier by exploiting safe parallelism that weakest‑token rules miss, achieving up to 37% higher throughput at comparable accuracy. Our anonymous code release is at https://github.com/Ringo‑Star/FastdLLM_plusplus.

Authors:Yiran Qiao, Jing Chen, Jiaqi Xu, Yang Liu, Qiwei Zhong, Xiang Ao
Title: Outsmarting the Chameleon: Counterfactual Decoupling for Tactical OOD Shifts in Live Streaming Risk Assessment
Abstract:
Live streaming has emerged as a primary medium for social interaction and digital commerce, yet it is increasingly plagued by sophisticated risks. A fundamental challenge in this domain is \emphtactical out‑of‑distribution (OOD) shift: while malicious actors maintain stable underlying objectives, they continuously redesign narrative packaging to evade detection. Such adversarial shifts expose critical limitations of existing OOD generalization paradigms, whose assumptions are difficult to satisfy in the presence of tightly coupled intent‑tactic evolution and ill‑defined raw‑level counterfactuals. In this paper, we tackle this issue from a \emphlatent causal perspective and propose \underlineLatent‑\underlinePredictive \underlineCounterfactual \underlineDecoupling~(LPCD), a plug‑in framework for robust live streaming risk assessment. LPCD enables counterfactual reasoning under adversarial tactical re‑packaging by modeling intent and narrative variation at the latent level, and enforces \emphlatent counterfactual consistency to anchor risk prediction on causally stable malicious intent. At inference time, LPCD applies a lightweight, parameter‑free calibration to further mitigate tactic‑induced distribution shifts. Extensive experiments on large‑scale industrial datasets and online production traffic demonstrate that LPCD consistently outperforms state‑of‑the‑art baselines, validating its effectiveness in moderating evolving adversarial risks in real‑world live streaming. The project page is available at https://qiaoyran.github.io/LiveStreamingRiskAssessment/.

Authors:Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski
Title: Cosmos 3: Omnimodal World Models for Physical AI
Abstract:
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture‑of‑transformers architecture. By supporting highly flexible input‑output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI ‑‑ effectively subsuming vision‑language models, video generators, world simulators, and world‑action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state‑of‑the‑art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general‑purpose backbones for embodied agents. Our post‑trained Cosmos 3 models were ranked as the best open‑source Text‑to‑Image and Image‑to‑Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW‑1.1 https://openmdw.ai/license/1‑1/ License at https://github.com/nvidia/cosmosgithub.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos‑lab/cosmos3 .

Authors:Peixuan Han, Hongyi Du, Jiayu Liu, Yihang Sun, Yutong Liu, Jiaxuan You
Title: $Ψ$-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues
Abstract:
Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose Ψ‑Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation. We design three real‑world interaction scenarios that involve persuasion in Ψ‑Bench, and endow simulated clients with personal characteristics through explicit user profiles derived from dialogue histories. We evaluate 10 frontier LLMs on Ψ‑Bench and find that while most models can produce coherent and reasonable arguments, even state‑of‑the‑art models still leave considerable room for improvement in persuasion. We also find that providing access to client profiles yields an average performance gain of 18.24%, highlighting the importance of user‑specific information for effective persuasion. Overall, our work highlights persona‑sensitive influencing as a challenging yet practical direction for evaluating and developing more proactive personalized LLM agents. Codes are available at: https://github.com/Hanpx20/Psi‑Bench.

Authors:Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng
Title: Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
Abstract:
On‑Policy distillation (OPD) in large language models is shifting from full‑trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe‑OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe‑OPD first filters trajectories to remove low‑quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe‑OPD leverages a soft‑weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer‑grained OPD optimization. We validate the effectiveness of FiRe‑OPD across strong‑to‑weak, single‑teacher, and multi‑teacher settings, and demonstrate its superiority over recent token‑level OPD methods ( (e.g., +6.25 on AIME 2024 in strong‑to‑weak, +18.81 on Miner in multi‑teacher). Our code is available at https://github.com/YuYingLi0/FiRe‑OPD.

Authors:Yunlong Zhou, Chen Zhao, Danyang Peng, Fanfan Ji, Xiao-Tong Yuan
Title: Learning to Refine: Spectral-Decoupled Iterative Refinement Framework for Precipitation Nowcasting
Abstract:
Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade‑off: regression models produce over‑smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral‑Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency‑decoupled refinement. SDIR first extracts a stable low‑frequency synoptic skeleton, then iteratively refines high‑frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual‑path design: the Synoptic Frequency‑Guided Former (SFG‑Former) with Scale‑Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR‑Refiner) with Scale‑Conditioned Fourier Neural Operators for fine residuals. A Physically Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces a turbulence‑consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA methods in spatial accuracy while achieving spectral fidelity competitive with diffusion‑based methods, enabling reliable high‑resolution operational nowcasting. Code link: https://github.com/RuntimeWarning/SDIR.

Authors:Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh
Title: SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models
Abstract:
Despite the success of audio‑visual large‑language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio‑visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech‑vision hallucination in audio‑visual LLMs. Our benchmark diagnoses speech‑vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state‑of‑the‑art open‑source audio‑visual LLMs struggle with aligning speech content with corresponding visual signals, with a near‑random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open‑source models. Our analysis suggests that their failures stem from limited ability in cross‑modality understanding, despite strong performance in single‑modality perception. Our work uncovers a new and fundamental limitation of current audio‑visual LLMs and highlights the need for speech‑grounded video comprehension. Project page: https://chenshuang‑zhang.github.io/projects/svhalluc/.

Authors:Zaifei Yang, Samuel Ping-Man Choi, James Kwok
Title: Enhancing Protein-Protein Interaction Prediction with Hierarchical Motif-based Multimodal Protein Embedding
Abstract:
Protein‑protein interactions (PPIs) are essential for many biological processes. However, existing PPI prediction approaches suffer from two major limitations: they overlook the hierarchical organization of proteins, particularly meso‑scale motifs that critically regulate PPIs, and fail to effectively integrate sequence, structure, and function modalities. To address these limitations, we propose MMM‑PPI, a Hierarchical Motif‑based Multi‑Modal protein Encoder for PPI Prediction that constructs PPI embeddings in a bottom‑up multi‑modal manner across three scales. At the micro‑scale, we encode three modal residue features; at the meso‑scale, a novel multimodal motif encoder aggregates residues into spatially‑informed motif embeddings; at the macro‑scale, a multimodal protein encoder integrates motifs into protein embeddings by jointly modeling motif importance and inter‑modal correlations. The pre‑trained encoder can be used off‑the‑shelf for large‑scale PPI prediction. Extensive experiments on multiple PPI datasets show that MMM‑PPI outperforms state‑of‑the‑art multi‑label PPI prediction models, particularly under challenging data partitions and limited data scenarios. Codes are in https://github.com/yzf‑code/MMM‑PPI.

Authors:Jin Gao, Juntu Zhao, Zirui Zeng, Jiaqi Shen, Junhao Shi, Dukun Zhao, Yuming Lu, Dequan Wang
Title: TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering
Abstract:
AI for scientific discovery is entering an agentic era, where protein‑engineering systems are expected to prioritize future wet‑lab experiments rather than merely fit static measurements. We introduce TadA‑Bench, a million‑variant wet‑lab replay benchmark from 31 TadA directed‑evolution rounds for future‑round discovery toward agentic protein engineering. TadA‑Bench preserves the campaign chronology and defines a fixed‑data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph‑based label‑unification pipeline, to reconcile noisy enrichment measurements into consistent cross‑round activity labels. Random‑split controls show strong interpolation, but future‑round ranking and finite‑budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA‑Bench as a reproducible wet‑lab replay substrate for future‑round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.

Authors:Nikola Cenikj, Özgün Turgut, Alexander Müller, Alexander Steger, Jan Kehrer, Marcus Brugger, Daniel Rueckert, Philip Müller
Title: Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification
Abstract:
Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X‑ray) angiograms remain the standard for stenosis diagnosis, they are invasive, time‑ and resource‑intensive, and therefore only performed on patients with a high probability of disease based on symptoms and prior clinical tests. However, a subset of patients, especially those without symptoms, may remain undiagnosed. Detecting indications of stenosis from ECGs, which are fast, cheap, non‑invasive, and thus routinely acquired even in asymptomatic patients, would support early diagnosis. However, as no reliable stenosis‑specific signal has been identified in ECGs, they can not currently be used for stenosis risk stratification. To address this, we introduce StenCE, a pretraining framework, allowing stratification of patients based on features derived directly from ECGs. Evaluations across varying stenosis severity thresholds and additional ECG disease classification tasks demonstrate consistent performance improvements across different ECG encoders, outperforming previous work. The obtained models successfully detect signals for stenosis diagnosis in ECGs and are the first to achieve high performance in severe stenosis classification. The source code is available at https://github.com/NikolaCenic/ecg‑stenosis‑cls.

Authors:Anherutowa Calvo
Title: Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent
Abstract:
The curvature exponent α in h_k \propto σ_k^α ‑‑ governing how Hessian eigenvalues scale with gradient singular values ‑‑ varies systematically across layer types (α\approx 2 for convolutions, \approx 1 for transformer attention, < 1 for MLP up‑projections). Why? We prove the Spectral Alignment Decomposition: α= 2 + d\logΦ_k / d\logσ_k, where Φ_k measures alignment between Kronecker factor eigenbases and gradient singular directions. This reduces "why does α vary?" to a geometric question we answer for LayerNorm, residual connections, and softmax heads. The decomposition implies a spectral transfer identity s = αγ linking curvature exponent, effective gradient rank‑decay γ, and Hessian decay exponent s. The identity is algebraic; its empirical content is that α and γ, fit on independent data (HVPs vs. SVD), recover s to ~2% median error across 93 layers, five architectures, and three datasets ‑‑ with no free parameters. A zeta‑function bound on participation ratio shows curvature concentrates onto effectively one direction per layer. As a proof of concept, we derive the architecture‑adaptive preconditioner T(σ;α) and show that Spectral Newton ‑‑ implementing T in the gradient singular basis ‑‑ outperforms AdamW on vision benchmarks where α\approx 2.

Authors:Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, Lillian Tsai, Alex Bie
Title: ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?
Abstract:
Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically‑regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never‑before‑seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans‑corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally‑generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non‑private synthesis transfers substantial knowledge from the original corpus, while state‑of‑the‑art DP synthesis methods generally fail to do so, even at \varepsilon=100.

Authors:Zewen Liu, Zhan Shi, Yisi Sang, Bing He, Minhua Lin, Tianxin Wei, Dakuo Wang, Benoit Dumoulin, Wei Jin, Hanqing Lu
Title: Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams
Abstract:
Auto‑harness systems such as A‑Evolve, GEPA, and Meta‑Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open‑ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task‑wise adaptation. We introduce Adaptive Auto‑Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi‑agent evolver, a harness tree with solve‑time routing, and human‑steering hooks for cases where history lacks the needed signal. Across prediction‑market, security‑competition, and event‑forecasting streams, Adaptive Auto‑Harness outperforms five existing auto‑harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in \hrefhttps://github.com/A‑EVO‑Lab/a‑evolve/tree/release/adaptive‑auto‑harnessLink.

Authors:Chad A. Capps
Title: CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability
Abstract:
We present CART (Context‑Anchored Recurrent Transformer), a parameter‑efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key‑value tensors at every iteration, CART computes K and V once from a multi‑layer prelude and has the recurrent core cross‑attend to those frozen tensors via multi‑head latent attention. A learned Linear Time‑Invariant (LTI) gate keeps the recurrence stable: its spectral radius settles in a narrow band (rho in [0.79, 0.83]) across all 36 fully‑trained configurations. We evaluate CART on single consumer GPUs in two stages: a 64‑configuration screen at 3,000 steps, then 36 configurations (P=6, R in 6,8,10, three seeds) trained for 30,500 steps (~1B tokens). Two patterns hold across widths d in 256,512,768,1024: prelude depth P dominates loop count R, and the Stage‑1 ranking of R reverses at full training (R=6 becomes best at d>=512). At the binding d=1024 parameter‑parity test, CART does not beat a parameter‑matched dense baseline, losing by 1‑2% at stored‑parameter parity and by ~10% at effective‑parameter parity. Diagnostic ablations split the effective‑parameter gap into ~5% from weight sharing and a residual ~5% from the heterogeneous prelude/anchor/core/coda framing; the recurrent‑core machinery (hyper‑connections, LTI gate, loop‑index embedding) is individually vestigial. Variable‑R inference degrades on both sides of the trained R, a negative result for test‑time depth scaling under this recipe.

Authors:Shailesh Rana
Title: The Shape of Wisdom: Decision Trajectories in Language Models
Abstract:
Language models do not simply choose an answer at the output layer. In a 9,000‑trajectory MMLU study across Qwen2.5‑7B‑Instruct, Llama‑3.1‑8B‑Instruct, and Mistral‑7B‑Instruct‑v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next‑layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable‑correct, not stable‑correct. A traced subset then asks what moves the margin. In stable‑correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer‑supporting text hurts the margin and removing distractor‑like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

Authors:Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Mykola Pechenizkiy, Elena Mocanu, Torsten Hoefler, Decebal Constantin Mocanu
Title: When Data Is Scarce: Scaling Sparse Language Models with Repeated Training
Abstract:
Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data‑constrained regimes where limited unique tokens require multi‑epoch training. Our experiments span models up to 1.92B parameters in the fitting set, sparsity up to 93.75%, unique data budgets up to 2.6B tokens, and total training tokens up to 41.6B over 16 epochs; we further validate extrapolation on held‑out dense‑equivalent models up to 7.68B parameters. We find that: 1. Sparse scaling in data‑limited settings: We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2. Delayed data saturation: sparse training postpones diminishing returns from repeated data, making multi‑epoch training more effective. 3. Resource trade‑offs: With fixed data, loss‑optimal sparsity is moderate ~ 50%, while compute‑optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade‑offs under data scarcity. Our code is available at: https://github.com/boqian333/sparse‑dc‑scaling.

Authors:Jun-Jie Yang, Chia-Heng Hsu, Kui-Yuan Chen, Ping-Chun Hsieh
Title: From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning
Abstract:
Preference‑based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two‑stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward‑free representation learning (RFRL) from the zero‑shot RL literature, and propose a new training framework that first learns latent successor‑measure representations from reward‑free offline data, followed by contrastive search and fine‑tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback‑efficient solution. Our code is publicly available at https://github.com/rl‑bandits‑lab/FB‑PbRL.

Authors:Ahmed M. Adly
Title: Measuring the Symmetry--Data Exchange Rate
Abstract:
Equivariance theory predicts that an architectural symmetry prior reduces sample complexity by a factor of |G|; this is widely cited but rarely measured as a scaling law with controls that separate the prior from its confounds. On a controlled C_n‑symmetric task, we report three findings. First, a wrong‑group control with identical orbit size and matched compute is worse than no constraint (joint pairwise CI [+0.79, +3.26] excludes zero, robust across estimators); misaligned constraint is actively harmful, not merely unhelpful. Second, an augmentation baseline equipped with test‑time orbit averaging matches the equivariant model exactly ‑‑ bit‑identical per‑epoch validation curves across matched cells ‑‑ so the architecture‑vs‑augmentation gap is conditional on asymmetric test‑time computation, not unconditional. Third, the relative exchange rate beta_diff = 1.28 is consistent in sign and order of magnitude with the theoretical 1.0 (single‑level CI [+0.92, +2.05]); the more conservative two‑level bootstrap (seeds x group sizes) widens this to [‑0.63, +1.72], including zero, and a finer‑N replication on a sqrt(2)‑spaced grid is inconclusive (point estimate ‑0.82). The methodological contributions ‑‑ the relative‑rate estimator that cancels the shared‑difficulty confound, the wrong‑group control, and a pre‑specified failure taxonomy ‑‑ transfer to any inductive bias whose strength can be parameterised. Honest scoping: the primary estimator beta_diff was adopted post‑hoc after the initial analysis revealed a positive‑slope identifiability problem; the design was never externally pre‑registered; and the headline number rests on an OLS slope over seven group sizes on a coarse N grid. This is an exploratory study, not a confirmatory measurement; the wrong‑group result is the cleanest finding and the one we report with the most confidence. A registered replication on fresh seeds is future work.

Authors:Wyame Benslimane, Tinghan Ye, Pascal Van Hentenryck, Paul Grigas
Title: Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback
Abstract:
Decision‑focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and full observations of the objective cost vector. We develop an on‑policy learning method for sequential contextual linear optimization under partial feedback, generalizing the standard bandit feedback setting. Our method learns a stochastic predict‑then‑optimize policy that samples a cost‑vector prediction from a conditional distribution and solves the resulting downstream linear optimization problem. To update this distributional model, we introduce a two‑component hybrid gradient estimator. The first component is a score function estimator, which provides an unbiased but potentially high‑variance policy gradient estimate. The second is a decision‑focused plug‑in component that uses an auxiliary nuisance estimate of the latent cost vector to exploit the downstream optimization structure, becoming more informative as the estimate improves. We prove an \mathcalO(T^‑1/2) bound on the average squared policy‑gradient norm, matching the standard non‑convex SGD rate. Experiments on top‑k selection, shortest path, combinatorial pricing, and a real‑data energy‑scheduling benchmark show that the hybrid gradient approach achieves lower cumulative regret than contextual‑bandit‑style baselines across all benchmarks, using both Gaussian and richer conditional generative models. Code is available at https://github.com/Joeyetinghan/on‑policy‑bandit‑dfl.

Authors:Feifan Jiang, Yinan Bu, Shihao Wu, Gongjun Xu, Ji Zhu
Title: Efficient Synthetic Network Generation via Latent Embedding Reconstruction
Abstract:
Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black‑box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low‑dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution‑free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results on the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, which efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches. Code is available at https://github.com/FeifanJiang/syngler.

Authors:Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu, Torsten Hoefler
Title: Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling
Abstract:
Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST can suffer from optimization instability, manifested as loss spikes after topology updates. In this work, we show that the naive use of standard Adam‑based optimizers leads to a cold‑start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. To address this issue, we propose Sparse Memory‑Efficient Training (SMET), which stabilizes DST with optimizer warm‑up and improves training progress through density‑aware learning‑rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory‑efficient sparse pre‑training of LLMs, paving the way for sparse training as a practical alternative to dense training. Our code is publicly available at: https://github.com/QiaoXiao7282/SMET.

Authors:Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu
Title: Enhancing LLM Metacognition via Cognitive Pairwise Training
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome‑level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid‑training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning‑quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning‑‑metacognition trade‑off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math‑average points and +5.2 abstention‑F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at https://github.com/Tsinghua‑dhy/CPT.

Authors:Ziling Lu, Zongsheng Li, Xinke Shen, Kexin Lou, Yingyue Xin, Xiaoqi Chen, Shinan Wang, Xiang Chen, Jiahao Fan, Chenyu Huang, Xin Xu, Zhoujie Hou, Chen Wei, Quanying Liu
Title: OmniEEG-Bench: A Standardized Evaluation Benchmark for EEG Foundation Models
Abstract:
Electroencephalography (EEG) supports a variety of brain‑computer interface (BCI) tasks ranging from brain‑state monitoring to human‑LLM interactions. EEG foundation models are emerging, but evaluation remains fragmented due to heterogeneous datasets and nconsistent task protocols. Here, we introduce OmniEEG‑Bench, a unified benchmark and downstream task roadmap for EEG foundation models (FMs). It organizes evaluation of EEG FMs into six task families spanning (i) signal reliability, (ii) biometrics and disease, (iii) consciousness and state, (iv) cognition and emotion, (v) naturalistic stimulus decoding, and (vi) motor and interaction, introducing a new generation of tasks not systematically benchmarked in prior EEG FM work. OmniEEG‑Bench standardizes model deployment, task definitions, and metrics through a task‑card specification, and unifies 54 EEG datasets with consistent evaluation protocols. We benchmark 10 representative EEG foundation models and report a leaderboard that covers diverse evaluation settings. Both pretraining dataset diversity and model size are significantly associated with better average ranks across datasets, revealing scaling‑law behavior in EEG foundation models (Figure 1). These results suggest that scaling EEG foundation models requires not only larger architectures but also broader and more diverse pretraining data. The benchmark code is available at https://github.com/ncclab‑sustech/omni‑eegbench.git.

Authors:Subhadip Mitra
Title: Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs
Abstract:
Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B‑31B) with quality‑diversity evolution (MAP‑Elites) as an automated red‑teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/‑ 5.7% attack success rate (ASR; mean +/‑ std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/‑ 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/‑ 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44‑46% but only 14‑18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near‑100% across all generations, though a second‑judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self‑hosted judge; code and artifacts are available at https://github.com/bassrehab/red‑queen.

Authors:Subhadip Mitra
Title: Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Abstract:
Current approaches to LLM adversarial testing suffer from coverage gaps: manual red‑teaming does not scale, LLM‑as‑attacker methods exhibit mode collapse, and gradient‑based approaches produce uninterpretable gibberish. We introduce a quality‑diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP‑Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT‑4o‑mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open‑weight coding model (Devstral‑small‑2), we discover distinct vulnerability profiles: GPT‑4o‑mini is vulnerable to hypothetical and multi‑turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi‑turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model‑specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red‑queen.

Authors:Shrimon Mukherjee, Kishalay Das, Partha Basuchowdhuri, Pawan Goyal, Niloy Ganguly
Title: Latent Diffusion Pretraining for Crystal Property Prediction
Abstract:
Fast and accurate prediction of crystal properties is a central challenge in new materials design. Graph neural networks and Transformer‑based models have emerged as powerful tools for this task due to their ability to encode the local structural environment of atoms within a crystal. However, these models are data‑hungry, and in practice, labeled data for crystal properties are scarce. Pretraining‑finetuning strategies, particularly those based on diffusion models, have shown promise in addressing these limitations. In this work, we introduce a novel latent diffusion based pretraining framework, CrysLDNet, designed to mitigate data scarcity. Our approach integrates a Variational Autoencoder (VAE) with a diffusion model during the pretraining stage. The VAE encoder maps 3D crystal structures into a smooth latent space within which the diffusion process is applied. This latent diffusion pretraining enables the graph encoder to effectively capture structural and chemical semantics from large‑scale unlabeled data, which can then be finetuned for specific property prediction tasks. Comprehensive experiments on popular DFT datasets for property prediction reveal that CrysLDNet significantly outperforms both training‑from‑scratch and pretrained baselines, with improvements of 4.26% and 4.90% on the JARVIS and MP datasets, respectively. Additionally, the learned representations remain robust in sparse‑data conditions and are expressive enough to correct DFT errors when finetuned with limited experimental data. Code is available at: https://github.com/shrimonmuke0202/CrysLDNet.git.

Authors:Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh
Title: Confidence-Adaptive SwiGLU for Mixture-of-Experts
Abstract:
SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness ‑‑ the smoothness and selectivity of the gating function ‑‑ is typically fixed throughout training. In this work, we propose Confidence‑Aware SwiGLU (κ‑SwiGLU), a variant of SwiGLU for Mixture‑of‑Experts (MoE) models that adjusts expert gate sharpness according to token‑level routing confidence. Specifically, κ‑SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ‑SwiGLU on the FineWeb‑Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ‑SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence‑aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa‑swiglu.

Authors:Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban
Title: SORA: Free Second-Order Attacks in Fast Adversarial Training
Abstract:
Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single‑step variants, where robustness to multi‑step attacks collapses despite high single‑step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step‑size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state‑of‑the‑art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

Authors:Sheng'en Li, Dongmian Zou
Title: COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs
Abstract:
Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision‑layer framework for deployment‑stable fairness monitoring and control in online link recommendation. COPF (i) defines group‑level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph‑aware doubly robust (GA‑DR) estimators. We provide a noisy transfer theorem showing that Residual‑OI on estimated GA‑DR residuals implies bounds on exposure‑counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal‑dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst‑case spikes in exposure‑counterfactual group disparities with modest impact on ranking utility. Our code is available at https://github.com/lsnnnnnnnn/COPF.

Authors:Tianyang Xu, Tianci Liu, Niraj Rayamajhi, Ryan Patrick, Kranthi Varala, Ying Li, Jing Gao
Title: Prior-Guided Multi-Omic Transformers for Single-Cell Gene Regulatory Network Inference
Abstract:
Gene regulatory networks (GRNs) capture transcription factor‑target interactions and are central to understanding cell‑state regulation and disease. Reconstructing GRNs from paired single‑cell transcriptomic and chromatin accessibility data is promising but challenging: scATAC is extremely sparse, and most methods rely on fixed peak‑to‑gene links and weak supervision. We present EpiAwareNet, a prior‑guided multi‑omic Transformer framework that reconstructs GRNs from paired single‑cell data using only lightweight biological priors. In Stage 1, EpiAwareNet learns joint gene‑peak representations with a gene‑peak cross‑attention module, enabling data‑driven, gene‑specific aggregation of accessibility signals rather than hard‑coded peak‑to‑gene assignments. In Stage 2, EpiAwareNet incorporates a bulk‑derived GRN prior as noisy positive edges to provide weak supervision under label scarcity, refining regulatory scores while remaining robust to prior noise. In our experiments, EpiAwareNet improves GRN reconstruction over representative single‑ and multi‑omic baselines and yields GRNs with greater biological plausibility, such as improved recovery of known regulatory interactions, suggesting that lightweight biological priors from bulk data can effectively guide single‑cell GRN inference when combined with adaptive cross‑modal representation learning. Code and data will be available at https://github.com/tianyang‑x/EpiAwareNet_pub.

Authors:Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei
Title: MESA: Improving MoE Safety Alignment via Decentralized Expertise
Abstract:
Mixture‑of‑Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE‑based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost‑effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

Authors:Oluwaleke Yusuf, Adil Rasheed, Frank Lindseth
Title: Spatiotemporal Multi-Task Graph Transformer for Trip-Level Transit Prediction
Abstract:
Passenger count data from public transit systems reveals urban mobility patterns and is essential for planning, operation, and optimisation. However, non‑linear spatiotemporal interdependencies across stops and lines make modelling and prediction challenging. Existing approaches often rely on fixed temporal, spatial, or stop‑level formulations, limiting their ability to capture within‑trip evolution and network context. This study proposes SMT‑GraphFormer, a spatiotemporal multi‑task graph transformer that frames trip‑level transit prediction as sequence‑to‑sequence modelling. Given a line's stop sequence and trip‑level context, the model predicts successive boarding and alighting counts, with delay and dwell time treated as encoder‑side surrogate tasks. Key components include graph embeddings for multi‑relational stop similarity, a context encoder for weather and temporal information, and a multi‑gate mixture‑of‑experts module that produces task‑specific decoder representations for boarding and alighting predictions. Evaluation on public bus transit data from Trondheim, Norway, shows that SMT‑GraphFormer outperforms stop‑level tabular benchmarks, with ablation studies examining each component's contribution. The sequential formulation yields substantial gains on alighting prediction (+0.24 in R^2) and consistent improvements on boarding, delay, and dwell, confirming the value of explicit trip‑level sequential bias and inter‑target dependencies. These findings demonstrate the potential of transformer‑based sequence modelling for capturing complex spatiotemporal dynamics in public transit and underscore the value of architectures tailored to transit data rather than off‑the‑shelf tabular models. The proposed framework provides a horizon‑agnostic basis for scenario analysis in digital twin environments, supporting informed decision‑making by planners and transit operators.

Authors:Yuan Yao, Jin Song, Huixia Li, Tongtong Yuan, Jiaqi Wu, Yu Zhang
Title: Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain
Abstract:
Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source domain. The source domain typically contains semantically meaningful samples (e.g., images) to facilitate effective knowledge transfer. However, a recent study observes that the noise domain constructed from simple distributions (e.g., Gaussian distributions) can serve as a surrogate source domain in the semi‑supervised setting, where only a small proportion of target samples are labeled while most remain unlabeled. Based on this surprising observation, we formulate a novel problem termed Semi‑Supervised Noise Adaptation (SSNA), which aims to leverage a synthetic noise domain to improve the generalization of the target domain. To address this problem, we first establish a generalization bound characterizing the effect of the noise domain on generalization, based on which we propose a Noise Adaptation Framework (NAF). Extensive experiments demonstrate that NAF effectively leverages the noise domain to tighten the generalization bound of the target domain, leading to improved performance. The codes are available at https://github.com/AIResearch‑Group/SSNA.

Authors:Zining Liu, Yunhai Hu, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
Title: DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
Abstract:
Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision‑language models (VLMs) remains relatively unexplored. We propose~DREAM‑S, a novel SD framework designed specifically for fast and efficient decoding in VLMs. DREAM‑S leverages a neural architecture search (NAS) framework with target‑aware supernet training to automatically identify both the optimal interaction strategy between the draft and target models, and the most suitable draft model architecture for the underlying hardware implementation platform. DREAM‑S additionally incorporates adaptive intermediate feature distillation, guided by attention entropy, to enable efficient draft training. Experiments on a range of well‑established VLMs show that DREAM‑S achieves up to a 3.85× speedup compared to standard decoding approaches and significantly outperforms existing SD baselines. The code is publicly available at: https://github.com/SAI‑Lab‑NYU/DREAM‑S .

Authors:Hugues Van Assel, Edward De Brouwer, Saeed Saremi, Gabriele Scalia, Aviv Regev
Title: Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation
Abstract:
Generative modeling and self‑supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one‑step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population‑level discrepancy approximated by Fréchet‑style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39× reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at https://github.com/Genentech/semantic‑transport‑generation.

Authors:Wenya Yu, Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah
Title: ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression
Abstract:
Post‑Training Quantization (PTQ) and Low‑Rank Adaptation (LoRA) constitute the standard pipeline for efficient Large Language Model (LLM) deployment. However, applying them sequentially poses a problem: PTQ often leaves behind random noise that is spread out (across the model's weights) in a way LoRA can't easily fix, meaning that LoRA ends up wasting its limited capacity trying to fix uncorrectable noise instead of improving task performance. In this paper, we propose ProjQ, a novel framework for constraining quantization noise to the low‑rank manifold via orthogonal subspace projection. We derive an efficient alternating algorithm that shapes the quantization noise into a low‑rank structure, effectively offloading dominant error components to the subsequent adapter while minimizing the residual error in the orthogonal "uncorrectable" subspace. Our theoretical analysis demonstrates that ProjQ preserves strictly greater model plasticity for downstream tasks compared to standard PTQ. Extensive experiments on LLaMA‑2, Qwen2.5 and Qwen3 confirm that ProjQ consistently outperforms existing methods in both quantization error compensation and downstream task fine‑tuning, achieving up to 2× lower evaluation loss for compensation and matching the performance of standard 4‑bit baselines on language modeling tasks with only 3 bits. The code is available on https://github.com/yy9301/ProjQ .

Authors:Artem Artemev, Rui Xia, Benjamin M. Boyd, Youjing Yu, Felix Dangel, Guillaume Hennequin, Alberto Bernacchia
Title: Exploiting weight-space symmetries for approximating curvature
Abstract:
Many machine learning techniques rely on approximating a loss function's curvature, but this is notoriously hard to do at the scale of modern deep networks. Surprisingly, no previous work has exploited the curvature constraints that arise from well known weight‑space symmetries in loss landscapes. By analytically averaging over group actions that leave the loss invariant, we construct structured Hessian approximations from single gradients that can be tractably estimated, stored, and inverted. The choice of user‑specified symmetry group directly governs the trade‑off between approximation accuracy and computational cost. Moreover, our framework provides a unifying theoretical lens for viewing existing methods; in particular, a specific choice of symmetry group recovers Shampoo/Muon‑like curvature estimates. We validate our method on a range of network architectures, and deploy it to second‑order optimization benchmarks, including a small language model. Our curvature estimation framework might find applications in other machine learning problems such as uncertainty estimation, continual learning, compression/pruning, training data attribution, and more.

Authors:Tianyu Pang, Vignesh Kothapalli, Shenyang Deng, Haohui Wang, Dawei Zhou, Yaoqing Yang
Title: Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks
Abstract:
We study optimal learning‑rate selection in two‑layer and three‑layer linear neural networks trained to learn linear target functions. In particular, we derive the exact closed‑form expressions for the gradients and test loss after one and two steps of gradient descent, enabling a precise characterization of early training dynamics. We characterize how learning rates should scale under the gradient approximation in the first two steps, and prove that performing updates with this approximation yields a tractable surrogate loss with a tight, small approximation error. This formulation enables the theoretical analysis of layer‑wise learning rates and reveals a distinct early‑training regime: test loss can be minimized by unequal learning rates at the initial step, while equal learning rates become optimal in subsequent steps. Our numerical experiments validate the theory and demonstrate the importance of balancing layer‑wise learning rates early during training. The code is available at: https://github.com/TDCSZ327/Layer‑Balancing.

Authors:David Mullett
Title: Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control
Abstract:
Recursive systems can enter collapse‑like regimes ‑‑ self‑reinforcing amplification, persistent recursion, and narrowing diversity that mask accelerating internal degradation ‑‑ before overt failure becomes visible. We introduce Loopzero, a claim‑bounded benchmark framework for testing whether recursive failures follow a directional telemetry pattern: rising gain (G), recursive persistence (p), and declining diversity (δ). The claim boundary is specified in Lean; the Lean artifact does not verify real telemetry, benchmark validity, or detector performance. We evaluate the bridge on two frozen public‑artifact benchmarks: a segmented public‑markets benchmark (Volmageddon 2018, COVID MWCB 2020) and a MovieLens‑25M offline deterministic recommender replay. Detectors are evaluated under a locked equal‑false‑positive contract (FP \in [0.03, 0.07], pre‑registered) so all configurations face the same alert budget. Neither tested standard comparators nor Loopzero's pre‑registered quantile detector achieved an accepted operating point. Directional witness alignment held on both canonical benchmarks, with adjacent‑horizon and row‑level limitations disclosed. Digitized Shumailov et al. (2024) LLM training‑loop trajectories are directionally consistent with the pattern; matched‑FP evaluation in that domain is deferred. The contribution is a reproducible, falsifiable benchmark framework for evaluating recursive‑collapse warning claims under an explicit alert‑budget contract ‑‑ non‑acceptance reported as a first‑class scientific outcome.

Authors:Pau Montagut Bofi, Mario García Blasco, Tessa Pulli, Markus Vincze
Title: Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation
Abstract:
Fine‑tuning Vision‑Language‑Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy‑to‑predict joints can mask joints that still fail. We fine‑tune SmolVLA (450M, action‑expert only) on the 11‑DoF Toyota HSR and compare it against π_0.5 (3.3B), a stronger pretrained baseline. Per‑group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert‑only fine‑tuning of π_0.5 (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real‑robot trials (20 per model), π_0.5 80k (4.0/4) significantly outperforms both fine‑tuned variants (expert‑only 3k: 3.75/4; HSR‑SmolVLA: 3.5/4; Mann‑Whitney p \leq 0.010), despite expert‑only 3k having the lowest total MSE. This separation is most consistent with the offline arm‑group error, not total MSE or base‑group error. We conclude that per‑group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per‑group‑mse‑vla

Authors:Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin
Title: From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets
Abstract:
Standard machine learning pipelines often admit many near‑optimal models. These "Rashomon sets" pose a range of challenges and opportunities for uncertainty‑aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real‑world datasets. Code for PRAXIS is available at https://github.com/zakk‑h/PRAXIS

Authors:Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li
Title: BAGEN: Are LLM Agents Budget-Aware?
Abstract:
While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget‑Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget‑awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout‑replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget‑awareness, with correlation r=0.35. (2) frontier models are consistently over‑optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget‑aware signal is actionable and trainable. Early stop saves 28‑64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen‑ai.github.io/bagen/

Authors:Leon Pohl, Lukas Beer, George Sebastian, Mirko Maehlisch
Title: Modeling Robotics Dataset Construction as an Artifact-Based Build Process
Abstract:
Robotic systems generate large volumes of multimodal sensor data, but converting ROS bag recordings into machine learning datasets is often handled by ad hoc sequential scripts, creating engineering overhead and slow iteration cycles. We model dataset construction as an artifact‑based build process over a dependency graph and implement this approach in Bagzel, an open‑source Bazel extension for reproducible, incremental dataset generation (including nuScenes‑format export). We compare Bagzel and Bagzel‑xattr (server‑side digest management) against a sequential rosbag2nuscenes baseline. Bagzel reduces runtime in all evaluated execution modes, with the largest gains in iterative workflows (up to 386.26x in warm builds and 7.21x in incremental builds on a 20.4 GB dataset). Across dataset sizes from 5.1 to 20.4 GB, Bagzel variants show markedly better scaling behavior than the baseline, especially in warm and incremental modes. Bagzel‑xattr provides additional gains, with a mean runtime reduction of 5.9% compared to Bagzel in the input granularity study. Overall, modeling robotics dataset construction as an artifact‑based build process substantially reduces dataset update latency while maintaining a deterministic build design that supports reproducibility. Bagzel is publicly available at https://github.com/UniBwTAS/bagzel.

Authors:Ambreen Aslam, Maaz Hassan, Bibi Zahra, Muhammad Khuram Shahzad
Title: XAI-SOH-FL: Enhancing SOH-FL with Adaptive Aggregation and Explainable AI for Intrusion Detection in Heterogeneous IoT
Abstract:
Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of labeled data, and limited model interpretability. Federated Learning (FL) offers a privacy‑preserving solution; however, existing approaches such as SOH‑FL suffer from two key limitations: reliance on a manually tuned aggregation parameter γ and lack of explainability in model predictions. In this paper, we propose XAI‑SOH‑FL, an enhanced framework that integrates adaptive aggregation and explainable artificial intelligence into the SOH‑FL paradigm. First, we introduce a dynamic γ selection mechanism based on similarity thresholding, enabling the aggregation process to adapt to evolving data distributions. Second, Bayesian Optimization is employed to automatically determine optimal γ values, eliminating the need for manual tuning. Third, SHAP (SHapley Additive exPlanations) is incorporated to provide feature‑level interpretability for intrusion detection decisions. Experimental evaluation on the CICIDS2017 dataset demonstrates that the proposed approach achieves an accuracy of 94.12% and an F1‑score of 0.92, outperforming the baseline SOH‑FL model while converging in fewer communication rounds. Furthermore, SHAP‑based analysis reveals that flow‑level features such as Flow Duration and Packet Length significantly influence model predictions. These results indicate that XAI‑SOH‑FL provides an effective balance between accuracy, adaptability, and interpretability in heterogeneous IoT environments.

Authors:Yuanyuan Wang, Wenjie Wang, Kun Zhang, Mingming Gong
Title: Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions
Abstract:
Bridging the gap between visual realism and physical understanding is a core challenge for video‑based world models. We study the structural identifiability of continuous‑time physical laws from raw pixels, focusing on whether an encoder‑only pipeline can uniquely recover the parameters of second‑order linear ODEs. We prove that a level‑set slope‑coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance‑floor regularizer to stabilize the decoder‑free objective and prevent latent collapse. Validated on synthetic and real‑world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute‑intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.

Authors:Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche
Title: DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions
Abstract:
Distributed Acoustic Sensing (DAS) enables large‑scale monitoring through optical fibers, but its high dimensionality and complex spatio‑temporal patterns make event classification demanding. Existing deep learning approaches‑CNNs, recurrent models, and Transformer variants‑either fail to capture long‑range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA‑selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step‑wise and channel‑wise attention branches, fused by an adaptive gating mechanism. Experiments on the open Φ‑OTDR benchmark and a real‑scenario DAS dataset show that DAS‑tatFormer achieves up to 99.4% accuracy and near‑perfect real‑world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real‑time DAS‑based monitoring. We release our code at https://github.com/MichelD‑git/DAStatFormer

Authors:Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu
Title: BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
Abstract:
Mixture‑of‑Experts (MoE) large language models reduce per‑token computation through sparse expert activation, but their deployment remains memory‑intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra‑low‑bit regime: pruning irreversibly removes model capacity, while coarse‑grained quantization fails to allocate bits according to heterogeneous expert and weight‑direction importance. We propose BitsMoE, a spectral‑energy‑guided bit‑allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert‑specific spectral factors, retaining the shared basis without quantization to preserve common cross‑expert structure and using the expert‑specific factors as fine‑grained quantization units. To determine the bit‑width of each unit, BitsMoE formulates spectrum‑wise mixed‑precision quantization as an activation‑aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra‑low‑bit regimes. Under 2‑bit quantization on Qwen3‑30B‑A3B‑Base, BitsMoE accelerates quantization by 12.3×, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76× over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.

Authors:Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang
Title: TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models
Abstract:
The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto‑regressive competitors in language processing. However, their flexible, any‑order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM‑Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU‑ML/TrustLDM.

Authors:Alireza Kheirandish, Jihoon Hong, Sara Fridovich-Keil
Title: KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems
Abstract:
Diffusion models have shown promising performance as data‑driven priors for computational imaging, as well as some capacity to detect out‑of‑distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback‑Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems. Our code can be found at https://github.com/voilalab/KLIP.

Authors:Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
Title: LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
Abstract:
Long‑context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low‑confusability distractors and sparse, outcome‑only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textscLongTraceRL. For data construction, we generate multi‑hop questions via knowledge graph random walks and leverage search agent trajectories to build \emphtiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one‑shot search. For reward design, we propose a \emphrubric reward that uses the gold entities along each reasoning chain as fine‑grained, entity‑level process supervision. This rubric reward is applied only to responses with correct final answers (positive‑only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B‑‑30B) across five long‑context benchmarks demonstrate that \textscLongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence‑grounded reasoning. Codes, datasets and models are available at \hrefhttps://github.com/THU‑KEG/LongTraceRLhttps://github.com/THU‑KEG/LongTraceRL.

Authors:Jiefang Xiao, Maolin Gao, Simon Weber, Guandao Yang, Daniel Cremers
Title: Functional Attention: From Pairwise Affinities to Functional Correspondences
Abstract:
Learning mappings between infinite‑dimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer‑based operators are popular, they often rely on token‑wise attention. These methods treat continuous fields as discrete tokens and usually ignore the global functional structure. We introduce \emphFunctional Attention, which reinterprets attention as a functional correspondence between adaptive bases. Inspired by geometric functional maps, our method replaces softmax affinities with structured linear operators. This yields a compact, generalizable, resolution‑invariant representation that explicitly captures global dependencies. Experiments demonstrate that \emphFunctional Attention can match state‑of‑the‑art performance in many operator learning tasks, including solving PDEs, 3D segmentation, and regression, while remaining robust to varying discretizations. Project page is available at https://github.com/xjffff/FUNCATTN.

Authors:Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer
Title: RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
Abstract:
Self‑supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard‑to‑predict scaling behavior of multi‑network system designs. We introduce RayDer, a unified, feed‑forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self‑supervised NVS into a well‑posed single‑model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time‑varying content and enables stable training on unconstrained real‑world video. Importantly, RayDer keeps static‑scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic‑scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power‑law scaling with data and compute, and outperforms static‑scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero‑shot open‑set performance competitive with state‑of‑the‑art supervised approaches. Project Page: https://compvis.github.io/rayder

Authors:Arnas Uselis, Darina Koishigarina, Seong Joon Oh
Title: How can embedding models bind concepts?
Abstract:
Humans easily determine which color belongs to which shape in multi‑object scenes, an ability known as concept binding. Vision‑language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag‑of‑concepts model in cross‑modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni‑modal probes can recover object information. However, CLIP's binding function is high‑complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low‑complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding‑concepts‑complexity.

Authors:Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Title: GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization
Abstract:
GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground‑truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM‑driven searches scale to large search budgets, on‑device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU‑measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal‑budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.

Authors:Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu
Title: DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
Abstract:
Large language models are increasingly deployed in multi‑turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi‑turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine‑tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance‑Weighted Fine‑Tuning), a framework that operationalizes the theoretical insight that the KL‑regularized RL objective is equivalent to importance‑weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return‑based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi‑turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine‑tuning. Code is available at https://github.com/2020‑qqtcg/DRIFT.

Authors:Grégoire Martinon, Ibrahim Merad, Mohammed Raki
Title: Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation
Abstract:
Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM‑as‑judge proxies. Prediction‑powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open‑source Python library that unifies state‑of‑the‑art PPI estimators (PPI++, Stratified PPI, Predict‑Then‑Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost‑optimal) under a scipy‑style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

Authors:Nan Bao, Yifan Zhao, Wenzhuang Wang, Jia Li
Title: Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation
Abstract:
The layout‑to‑image (L2I) task enables fine‑grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few‑shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation‑driven framework that disentangles semantics from primitives for robust few‑shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency‑aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5‑shot regime over state‑of‑the‑art L2I methods in both visual fidelity and alignment across diverse atypical domains. The source code is publicly available at https://github.com/iCVTEAM/DSP.

Authors:Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song, Wei Xue, Yike Guo
Title: Scaling Multi-Hop Training Data via Graph-Constrained Path Selection
Abstract:
Endowing large language models with compositional reasoning over specialized documents requires multi‑hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question‑answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross‑referencing clauses, conditions that characterize most real‑world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre‑validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram‑matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ~91^\circ, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched‑size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4× expansion of the usable corpus rather than from higher per‑chain quality ‑‑ reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine‑tuning Qwen3‑32B on 80K examples constructed from the CUAD legal contract corpus improves closed‑book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai‑official/GCSCS.

Authors:Ei Hmue Khine, Yao Li, Jiebao Sun, Shengzhu Shi, Zhichang Guo, Boying Wu
Title: Latent Geometric Chords for Query-Efficient Decision-Based Adversarial Attacks
Abstract:
While decision‑based black‑box adversarial attacks present a severe security threat, current methodologies suffer from fundamental limitations. Pixel‑wise attacks frequently introduce unnatural, high‑frequency visual artifacts, while latent‑space frameworks are confined by the limited search space of low‑dimensional manifolds and inherent reconstruction flaws. To resolve these limitations, we propose Latent Geometric Chords (LGC) for Query‑Efficient Decision‑Based Adversarial Attacks alongside a variant, LGC‑H. At its core, LGC navigates decision boundaries by executing a curvature‑aware geometric search within a compressed semantic manifold. To guarantee high visual fidelity and circumvent dimensionality bottlenecks, we introduce a Residual‑based Adversarial Generation (RAG) mechanism. RAG isolates semantic perturbations as geometric chords and superimposes them directly onto the original source image. RAG substantially resolves baseline reconstruction flaws and effectively doubles the permissible search space dimensions. Experimental results demonstrate that LGC achieves robust cross‑dataset transferability and substantially outperforms state‑of‑the‑art baselines. Notably, our method, LGC, minimizes perturbation magnitudes while achieving state‑of‑the‑art visual fidelity‑‑with a Structural Similarity Index Measure (SSIM) exceeding 0.99 and a Learned Perceptual Image Patch Similarity (LPIPS) below 0.01 at 5000 queries‑‑and sustaining high attack success rates under stringent perceptual constraints, successfully compromising adversarially trained robust models. The source code is available at: https://github.com/eihmuekhine/Latent‑Geometric‑Chords.

Authors:Umut Onur Yasar
Title: Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10
Abstract:
We investigate how teacher‑student capacity relationships modulate knowledge distillation (KD) effectiveness in ResNet‑based image classification on CIFAR‑10. Across three teacher‑student pairs ‑‑ R50‑>R18, R34‑>R18, and R50‑>R34 ‑‑ we compare Logit‑KD and Feature‑KD under controlled, reproducible conditions (3 seeds, mean+/‑std reported throughout). We report three main findings. First, student capacity is a key moderating factor in distillation gain: R34 students benefit substantially more from KD than R18 students even when teacher‑student accuracy gaps are comparable, with the strongest gain of +0.30pp observed for R50‑>R34 Feature‑KD versus +0.18pp for R34‑>R18 Feature‑KD and +0.00pp for R34‑>R18 Logit‑KD. Second, implementation correctness critically affects Feature‑KD: a gradient clipping bug that excluded projection layers suppressed Feature‑KD performance and produced misleading comparisons with Logit‑KD. After correction, Feature‑KD matches or outperforms Logit‑KD in two of three pairs, reaching 95.55% on R50‑>R34 against a baseline of 95.25%. Third, input‑resolution‑aware architecture is a prerequisite for effective distillation: correcting the ResNet stem for 32x32 inputs raises teacher accuracy by over 5pp ‑‑ an order of magnitude larger than any KD gain. All code and results are available at github.com/umutonuryasar/kd‑capacity‑gap.

Authors:Miltiadis Stouras, Vincent Cohen-Addad, Silvio Lattanzi, Ola Svensson
Title: Retriever Portfolios: A Principled Approach to Adaptive RAG
Abstract:
Retrieval‑augmented generation (RAG) systems typically rely on a single retriever and a single set of hyperparameters, despite facing highly heterogeneous queries that range from simple factoid questions to complex multi‑hop reasoning. We propose a method that automatically selects a small, diverse subset of retrievers (a portfolio) from a large pool of candidates, to cover different regions of the target query distribution. We formalize this setting via an expected best‑of‑k objective over the query distribution and show that it admits an efficient portfolio construction algorithm with near‑optimal guarantees. Across multiple QA benchmarks, our learned portfolios and router pipeline consistently outperform single‑retriever and naive multi‑retriever baselines on both retrieval metrics and answer quality. In addition, compared to inference‑time hyperparameter tuning approaches, fixed portfolios enable parallel retrieval and LLM calls, achieving comparable (and sometimes better) accuracy with substantially lower latency and token cost.

Authors:Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo
Title: Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models
Abstract:
Interactive video world models generate video chunk by chunk in response to user‑controlled camera movements, enabling applications such as real‑time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training‑free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory‑dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early‑step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware‑software co‑designed 3D block sparse attention with fused Triton kernels. Evaluated on HY‑WorldPlay and Matrix‑Game‑3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

Authors:Willian T. Lunardi, Samridha Shrestha, Martin Andreoni
Title: Learning Hyperspherical Time-Frequency Representations for Time-Series Out-of-Distribution Detection
Abstract:
Out‑of‑distribution (OOD) detection for time‑series data remains comparatively underexplored compared to vision and language, with a limited principled understanding of how supervised time‑series representations can be leveraged for reliable detection under distributional shifts. This work formulates time‑series OOD detection as representation learning with hyperspherical embeddings, where class‑conditional structure is induced by a von Mises‑Fisher (vMF) likelihood‑based objective on the unit sphere. The learned representation combines time‑ and frequency‑domain views of the input signal via domain‑specific encoders, integrating them into a joint embedding space for OOD detection. Detection uses distance‑based scores over the learned embeddings, including k‑nearest neighbors (k‑NN) and Mahalanobis scores. We evaluate the approach at scale on the complete UCR and UEA time‑series archives under a cross‑dataset protocol. Empirical results show consistent improvements under both k‑NN and Mahalanobis scoring over strong contrastive learning and post‑hoc baselines in the same setting. Code is available at https://github.com/tiiuae/hypertf‑time‑series‑ood.

Authors:Cheonwoo Lee, Dooho Lee, Doyun Choi, Jaemin Yoo
Title: Generalizing Multi-Scale Time-Series Modeling with a Single Operator
Abstract:
Multi‑scale modeling has emerged as an effective design principle for time‑series forecasting by capturing temporal dynamics at multiple resolutions. As no principled foundation has been established in the literature, we unify existing scaling methods into a scaling operator family, revealing a fundamental limitation of existing approaches: reliance on fixed and discrete scaling. To address this limitation, we propose SiGMA (Single Generalized Multi‑scale Architecture), which enables distance‑aware scaling via the learnable discrete Gaussian (LDG) kernel grounded in scale‑space theory. We evaluate SiGMA comprehensively on long‑ and short‑term forecasting benchmarks against state‑of‑the‑art multi‑scale baselines. SiGMA outperforms all competitors on both tasks, especially achieving the best performance in 13 out of 16 long‑term evaluation settings. Beyond accuracy, SiGMA significantly improves training speed by up to 5.3 times and reduces memory consumption by up to 3.8 times over the strongest competitors. Code is available at https://github.com/cheonwoolee/SiGMA.

Authors:Gael Glorian, Ioannis Lamprou, Zhen Zhang, Yujie Yuan, Hongsheng Liu
Title: LVSA: Training-Free Sparse Attention for Long Video Diffusion
Abstract:
Dense self‑attention is the compute and quality bottleneck of long‑video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near‑static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training‑free model‑agnostic block‑sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed‑grid bias which causes long‑range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out‑of‑memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench‑Long. LVSA is quality‑neutral for generation at training horizon length and quality‑positive at extended lengths.

Authors:Jyotirmoy Singh, Anushka Roy, Shreea Bose, Chittaranjan Hota
Title: DEM: A Distilled Explanation Model for Interpretable Anomaly Detection in Physiological Sensor Networks
Abstract:
Anomaly detection in physiological sensor data from Wireless Body Area Networks (WBANs) can be caused by sensor faults, network disruptions, or missing data, leading to false alarms. Hence, it demands both high predictive accuracy and clinically interpretable explanations. Existing approaches rely either on black‑box models that achieve strong performance but offer no transparency, or on post‑prediction explanation methods such as SHAP and LIME. In this paper, we propose the Distilled Explanation Model (DEM), a three‑stage glass‑box framework that distills the non‑linear knowledge of a gradient boosting expert into an interpretable decision tree operating on residuals relative to a linear baseline, so that the explanation is not an approximation but the prediction itself. DEM introduces a novel distillation fidelity metric that quantifies how faithfully the explanation tree captures the expert model's non‑linear contribution, providing a principled measure of explanation trustworthiness absent from prior interpretable models. Evaluated across four physiological datasets, including MIMIC‑IV, WESAD, eICU, and an in‑house SmartNet WBAN corpus, DEM achieves an AUC of 0.9964 on clinical contextual anomaly detection and 0.9047 on wearable stress detection while producing human‑readable if‑then rules at a controllable depth. Inference requires 0.17ms per 1000 samples, rendering DEM 1235x faster than SHAP‑based post‑hoc explanation and suitable for real‑time physiological monitoring. Ablation studies confirm that the XGBoost distillation step provides measurable gains over naive residual fitting, and depth‑sensitivity analysis demonstrates an explicit, user‑controlled accuracy‑interpretability trade‑off unique to DEM among existing intrinsically interpretable models.

Authors:Giang Do, Hung Le, Truyen Tran
Title: Eigenvectors of Experts are Training-free Non-collapsing Routers
Abstract:
Sparse Mixture of Experts (SMoE) architectures improve the training efficiency of Large Language Models (LLMs) by routing input tokens to a selected subset of specialized experts. Despite their remarkable success, both training and inference in SMoE models suffer from the expert collapse issue (Chi et al., 2022), which degrades model performance. Prior studies primarily focus on improving the router; however, such methods rely on training from scratch or fine‑tuning, which requires high computational and data‑processing costs. Furthermore, we demonstrate that, despite these efforts, the issue persists when advancing well‑pretrained SMoE models, as evidenced by both theoretical and empirical results. To fill that gap, we analyze the advanced SMoE models and observe that the eigenvectors of expert weight matrices encode rich semantic information, pointing to an effective alternative to conventional routing strategies. Building on this insight, we propose Singular Value Decomposition SMoE (SSMoE), a novel and training‑free framework that leverages spectral properties of the expert weights to address the collapse issue and enhance model performance. Extensive experiments across diverse language and vision tasks, under both clean and corrupt data settings, demonstrate the strong generalization and robustness of SSMoE. Our findings highlight how a deeper understanding of model internals can guide the development of more effective SMoE architectures. Our implementation is publicly available at https://github.com/giangdip2410/SSMoE.

Authors:Junbin Qiu, Zhaowei Hong, Renzhe Xu, Yao Shu
Title: Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens
Abstract:
Accurate Zeroth‑Order (ZO) Hessian estimation is a cornerstone of derivative‑free methods, essential for tasks such as bilevel optimization, Bayesian inference, and uncertainty quantification. However, obtaining a complete suite of low‑variance estimators for the Hessian and its inverse in high‑dimensional settings remains a significant challenge. To address this, we propose a unified framework that reinterprets ZO Hessian approximation through the lens of single‑step Policy Optimization (PO). This perspective establishes a theoretical equivalence between general ZO Hessian estimators and the Hessian of a smoothed PO objective, unifying distinct classical randomized estimators as specific instances of baseline selection. Building on this foundation, we introduce ZoVH, a comprehensive suite of variance‑reduced estimators for the full Hessian matrix, its regularized inverse, and the bias‑corrected inverse Hessian‑gradient product. ZoVH leverages two key techniques: (1) a unique optimal baseline derived to provably minimize variance, and (2) a query reuse strategy that incorporates historical function queries to enhance sample efficiency without inflating costs. Our rigorous theoretical analysis confirms the unbiasedness of the Hessian estimator, validates the variance optimality of our baseline, provides error bounds for the entire ZoVH suite, and establishes convergence guarantees for the resulting curvature‑aware ZO algorithm. Extensive empirical results validate our theoretical findings, demonstrating that ZoVH achieves superior estimation accuracy and convergence performance in real‑world applications. Code is available at https://github.com/Qjbtiger/ZoVH

Authors:Shengyu Feng, Tarun Suresh, Yiming Yang
Title: Unsupervised Diffusion Solver for Combinatorial Optimization via Combinatorial Adjoint Matching
Abstract:
Diffusion‑based neural solvers have shown strong promise for combinatorial optimization (CO), but existing methods typically rely on supervised training with large collections of near‑optimal solutions. In this work, we extend adjoint‑based trajectory optimization methods to discrete combinatorial domains. We formulate diffusion‑based CO as a stochastic control problem over Continuous‑Time Markov Chains and introduce discrete adjoint dynamics for propagating optimization signals through discrete generative trajectories. Building on this formulation, we propose Combinatorial Adjoint Matching (CAM), an unsupervised training framework for discrete diffusion solvers with structured and low‑variance trajectory‑level optimization signals. Empirically, CAM consistently outperforms existing unsupervised diffusion baselines and achieves performance competitive with strong supervised diffusion solvers and even traditional solvers across diverse combinatorial optimization problems. Our code is available at https://github.com/Shengyu‑Feng/CAM.

Authors:Andreas Haupt, Justin Hartenstein, Anka Reuel, Mykel Kochenderfer, Sanmi Koyejo
Title: Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation
Abstract:
AI benchmarks have well‑documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item‑level scores, implicitly treating every test item as equally valuable. We model benchmarking as a multitask principal‑agent game and show that the welfare loss from a benchmark is determined jointly by three item‑level primitives: alignment with normative welfare priorities, marginal improvability, and performance variance. We translate the theory into an audit framework that ranks items along each of these three axes, and apply it to OLMES items using WORKBank for welfare, the EvoLM 4B suite for improvability, and the PolyPythias 410M panel for variance. The framework surfaces items that are Pareto‑inferior within OLMES subject to a pro‑worker welfare operationalization. All code is available at https://github.com/stair‑lab/principal‑agent‑benchmarks.

Authors:Jun Tan, Qing Guo, Zicheng Xu, Jinglin Li, Qi Fang, Ning Gui
Title: Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity
Abstract:
Counterfactual explanations (CEs) are essential for actionable recourse, yet their reliability is often compromised in low‑density regions, where classifiers exhibit high variance. Unlike existing methods that rely on expensive ensemble intersections to define stability, we propose DensityFlow, a generative framework that constructs robust CEs by adhering to the high‑confidence data manifold. Specifically, we model the counterfactual generation as continuous‑time dynamics parameterized by Neural ODE, guided by a differentiable density score to actively avoid uncertain, low‑density areas. This density score is learned via Noise Contrastive Estimation, effectively leveraging a (K+1)‑way discriminator to estimate density ratios. For black‑box settings, we introduce a local proxy distillation mechanism that aligns a lightweight surrogate with the target model strictly within the trajectory of CE generation, enabling efficient gradient‑based optimization with minimal queries. Experiments demonstrate that DensityFlow achieves superior validity under model multiplicity while significantly reducing query costs compared to ensemble‑based baselines. Our implementation is available at https://github.com/G‑AILab/DensityFlow.

Authors:Fengyu Gao, Jing Yang
Title: Differentially Private Preference Data Synthesis for Large Language Model Alignment
Abstract:
Preference alignment is a crucial post‑training step for large language models (LLMs) to ensure their outputs align with human values. However, post‑training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we propose DPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy‑preserving preference alignment. DPPrefSyn is a principled framework grounded in the Bradley‑Terry preference model and the intrinsic geometric structure of pairwise human preference data. It first learns an underlying preference model from private data with formal differential privacy guarantees, and then leverages the learned model together with public prompts to synthesize high‑quality preference data. It exploits the shared linear structure of per‑cluster reward models to effectively capture heterogeneous human preferences in private datasets, and leverages DP Principal Component Analysis (DP‑PCA) to improve learning accuracy. Extensive experimental results demonstrate that DPPrefSyn achieves competitive alignment performance under strong DP guarantees. These findings highlight the potential of synthetic preference data as a practical alternative for privacy‑preserving preference alignment across a broad range of applications. To the best of our knowledge, this is the first work to generate DP synthetic preference data for LLM alignment. Our code is available at https://github.com/gfengyu/Differentially‑Private‑Preference‑Data‑Synthesis.

Authors:Yachen Gao, Xinwei Sun, Yikai Wang, Ye Shi, Jingya Wang, Jianfeng Feng, Yanwei Fu
Title: Conformal Reliability: A New Evaluation Metric for Conditional Generation
Abstract:
Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is still lacking. Existing metrics, which typically assess a single output, may fail to capture the variability or potential risks in generation. In this paper, we propose a novel evaluation metric called reliability score based on conformal prediction, which measures the worst‑case performance within the prediction set at a pre‑specified confidence level. However, computing this score is challenging due to the high‑dimensional nature of the output space and the nonconvexity of both the metric function and the prediction set. To efficiently compute this score, we introduce Conformal ReLiability (CReL), a framework that can (i) construct the prediction set with desired coverage; and (ii) accurately optimize the reliability score within the constructed prediction set. We provide theoretical results on coverage and demonstrate empirically that our method produces more informative prediction sets than existing approaches. Experiments on synthetic data and the image‑to‑text and text‑to‑image tasks further demonstrate the interpretability of our new metric, and the validity and effectiveness of our computational framework. Source code can be found at https://ggc29.github.io/CReL/.

Authors:Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes, Taylor W. Killian, Fenglong Ma, Weitong Zhang, Porter Jenkins
Title: LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation
Abstract:
We study trajectory selection for reasoning distillation, where teacher‑generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student. In this paper, we present LARK, a learnability‑grounded method for reasoning trajectory selection. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution. At the core of LARK is a learnability factor ρ, which characterizes the rate at which the student's training loss decreases. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and a χ^2‑regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK‑selected trajectories induce faster supervised fine‑tuning loss reduction. Our code is available at https://github.com/Tianrun‑Yu/LARK.

Authors:Pierre-André Noël
Title: Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future
Abstract:
I present diffusion models as part of a family of machine learning techniques that withhold information from a model's input and train it to guess the withheld information. I argue that diffusion's destroying approach to withholding is more flexible than typical hand‑crafted information withholding techniques, providing a rich training playground that could be advantageous in some settings, notably data‑scarce ones. I then address subtle issues that may arise when porting reinforcement learning techniques to the diffusion context, and wonder how such exploration problems could be addressed in more diffusion‑native ways. I do not have definitive answers, but I do point my fingers in directions I deem interesting. A tutorial follows this thesis, expanding on the destroy‑then‑generate perspective. A novel kind of probabilistic graphical models is introduced to facilitate the tutorial's exposition.

Authors:Yiming Xiao, Ankit Basu, Kai Yin, Sahil Vartak, Christian Swords, Ali Mostafavi
Title: DisasterLex: An Expert Concept-to-Schema Knowledge Graph for Geospatial Reasoning in Disaster Analytics
Abstract:
Disasters are inevitable and increasingly costly, and effective response depends on querying structured tabular data: precise, information‑dense records of hazard, exposure, vulnerability, and lifeline infrastructure that underpin disaster management. Current text‑to‑SQL methods enable natural‑language access to such tables but transfer poorly to the disaster domain, where queries span heterogeneous geospatial schemas and require reasoning over causal relations. We introduce DisasterLex, a knowledge‑graph‑mediated framework that inserts an Expert Knowledge Graph (EKG) of curated concepts and typed causal edges between the user query and the database, bridged to schema by concept‑to‑table links. The orchestration runs four stages (identifying query entities, routing to the operational domain, planning over causal edges, and grounding the SQL), restricting the schema passed to the model at each step. We instantiate it on a disaster‑analytics database (36 geospatial tables, 150 columns) with an EKG of 107 concepts, 117 causal edges, and 52 concept‑to‑schema links, evaluated on a 75‑query test set. On all seven base models spanning proprietary and open‑weight families, DisasterLex beats four state‑of‑the‑art baselines (LightRAG, HippoRAG 2, ReFoRCE, CHESS) by 1.4x to 2.75x, with absolute scores of 1.65 to 3.56 (of 5.0). Error analysis shows baseline failures cluster in routing and multi‑table SQL composition, the operations our orchestration explicitly addresses. Code, data, and the EKG artifact are available at https://github.com/YimingXiao98/DisasterLex and on Zenodo at https://doi.org/10.5281/zenodo.20388029.

Authors:Amirhossein Ghaffari, Saeid Sheikhi, Ekaterina Gilman
Title: Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Abstract:
Spatio‑temporal forecasting on sensor graphs is commonly tackled with a single backbone architecture applied uniformly across all nodes, although graph regions can exhibit different dynamics. Road segments differ in functional class, structure, and traffic behavior, suggesting that node‑wise expert specialization can be useful. We propose GC‑MoE, a graph‑conditioned mixture of experts framework that assigns each node a personalized combination of frozen forecasting experts based on graph topology and the recent traffic input window. GC‑MoE combines frozen pretrained spatio‑temporal GNN experts with an input‑aware, spatially contextualized router while training only a lightweight routing module. We also study a bounded graph‑conditioned output refinement layer as an optional extension and include node‑adaptive ST‑LoRA adapters only as an ablation diagnostic. Across four standard benchmarks (PEMS04, PEMS07, METR‑LA, and PEMS‑BAY), GC‑MoE improves MAE over a zero‑parameter ensemble baseline, with competitive RMSE and MAPE, while training only ~17K parameters on top of 1.5M frozen expert weights. The implementation is available at https://github.com/Ahghaffari/gc_moe.

Authors:Ojas Nimase, Jiate Li, Yue Zhao, Yushun Dong
Title: Can Subgraph Explanations Be Weaponized to Steal Graph Neural Networks?
Abstract:
Graph Machine Learning as a Service (GMLaaS) platforms increasingly implement explainability interfaces to meet regulatory transparency requirements. However, this transparency creates exploitable vulnerabilities for model extraction attacks. We present the first model extraction attack specifically designed for graph classification under strict black‑box constraints where the attacker observes only discrete class labels and binary explanation masks (no probability scores, gradients, or confidence values). Our method (1) uses model explanation outputs to guide Monte Carlo edge sensitivity estimation toward decision boundaries, with Hoeffding concentration guarantees on estimation accuracy and (2) exploits explanation subgraphs to efficiently narrow the boundary search space. Extensive experiments on benchmark graph datasets across multiple domains demonstrate our method's superiority over comparable baselines. These findings demonstrate that such explainability interfaces create exploitable attack surfaces, informing both defensive mechanisms and policy frameworks for explainable AI mandates. The implementation code is provided in https://github.com/LabRAI/XSTEAL/.

Authors:Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang
Title: LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Abstract:
Real‑world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long‑horizon, multi‑turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real‑world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state‑evolution patterns (e.g., counterfactual perturbation, rollback, multi‑state composition), with an average dependency span of 11.3 turns. Evaluating five state‑of‑the‑art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long‑horizon errors account for 52%‑‑69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long‑horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

Authors:Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng
Title: Exploring Autonomous Agentic Data Engineering for Model Specialization
Abstract:
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high‑quality domain‑specific data. Existing LLM‑based data curation methods primarily rely on human‑designed workflows, leaving it unexamined whether LLMs can autonomously execute an end‑to‑end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end‑to‑end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post‑training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT‑5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent‑driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent‑driven model specialization\footnoteCode will be released at https://github.com/zjunlp/DataAgent..

Authors:Hwa Hui Tew, Junn Yong Loo, Fang Yu Leong, Julia K. Lau, Ding Fan, Hernando Ombao, Raphaël C. -W. Phan, Chee Pin Tan, Chee-Ming Ting
Title: Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification
Abstract:
Functional Magnetic Resonance Imaging (fMRI) provides non‑invasive access to dynamic brain activity by measuring blood oxygen level‑dependent (BOLD) signals over time. However, the resource‑intensive nature of fMRI acquisition limits the availability of high‑fidelity samples required for data‑driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non‑stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual‑Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi‑scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low‑frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class‑conditioned cosine‑frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time‑domain BOLD signals. This dual‑transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI‑based brain network classification. The code is available at https://github.com/htew0001/DSFM.git .

Authors:Mingxuan Yi, Vidal Mehra, Jing Chen, John Cartlidge
Title: Enhancing Regime Shift Detection Using Unstructured Data: A Study on the Treasury Market
Abstract:
Regime shifts in financial markets reorganise the joint dynamics of asset prices and macro variables, breaking any single‑regime calibration. They are nonetheless difficult to detect reliably because the data signal is noisy and heavily multicollinear, while the contemporaneous text that announces them is unstructured. Standard regime shift detection methods rely solely on structured time‑series data and ignore policy communications, even though these texts often signal shifts before they materialise in observed prices. We propose a text‑enhanced regime shift detection pipeline that combines large language model (LLM) reasoning over central‑bank communications with statistical validation on multivariate financial time series. The framework is detector‑agnostic: text‑proposed candidates are validated using a bootstrap likelihood‑ratio test on a vector autoregression (VAR), while data‑driven candidates from arbitrary regime detectors are ratified through a lenient LLM text check. We evaluate the framework on 2010‑2024 FOMC minutes paired with a 14‑variable U.S. Treasury and macroeconomic panel, using four interchangeable data‑driven detectors. The proposed pipeline achieves F1 = 0.82 against a verified anchor list of monetary‑policy regime shifts, with same‑day modal detection latency and consistently stronger performance than pure data‑driven baselines. The results demonstrate that combining unstructured policy text with statistical structural‑break detection improves the robustness and interpretability of regime shift identification in financial markets.

Authors:Zhenxiao Fu, Lei Jiang, Fan Chen
Title: QASM-Eval: A Dataset to Train and Evaluate LLMs on OpenQASM-3 Beyond Quantum Circuits
Abstract:
Quantum computing remains in the Noisy Intermediate‑Scale Quantum (NISQ) era, where the performance is highly constrained to noise. Addressing the limitation often requires hardware‑facing capabilities beyond gate‑sequence circuit specification, including mid‑circuit measurement and classical feedback for quantum error correction (QEC), precise timing control for dynamical decoupling (DD), and pulse‑level waveform access for calibration. OpenQASM‑3 was introduced to expose exactly these capabilities, providing a hardware‑level programming interface. However, despite the rapid progress of large language models in code generation, there is still no dataset specifically designed to train and evaluate LLMs on OpenQASM‑3 programs that involve its advanced hardware‑oriented features. To address this gap, we introduce QASM‑Eval, the first comprehensive dataset designed to train and evaluate LLMs on OpenQASM‑3. Rather than focusing on quantum algorithm design or reasoning, QASM‑Eval explicitly targets the language's hardware‑facing features. QASM‑Eval comprises an expert‑verified test set of 100 tasks and a training set of 4,000 tasks, systematically covering classical logic, timing scheduling, pulse control, and complex real‑world workflows. To automatically validate generated programs, we check syntax, quantum states and program timeline using an extended verifier. Our evaluation reveals that while state‑of‑the‑art LLMs struggle heavily in OpenQASM‑3 coding tasks, targeted fine‑tuning on QASM‑Eval yields significant gains. QASM‑Eval provides a crucial benchmark and training foundation to accelerate the development of reliable LLM assistants for hardware‑facing quantum programming in NISQ era. Data and code: https://github.com/fuzhenxiao/QASM‑Eval

Authors:Chen Henry Wu, Aditi Raghunathan
Title: Self-Trained Verification for Training- and Test-Time Self-Improvement
Abstract:
Self‑improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification‑refinement (V‑R) loops; and at training time, through self‑training methods. Both are gated by the same bottleneck: the verifier. V‑R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self‑training fails similarly when bad self‑generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self‑generated errors, lacks training signal. To address this challenge, we propose self‑trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V‑R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta‑verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V‑R loop ‑ a procedure we call verifier‑in‑the‑loop training (ViL). Starting from an RL‑converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar‑forum.github.io/stv‑webpage

Authors:Xin Wang, Linxin Xiao, Yang Yao, Wenwu Zhu
Title: OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction
Abstract:
Drug synergy prediction (DSP) aims to identify efficacious drug combinations under various cellular contexts with different targets. However, the continual emergence of novel compounds results in variations in molecular scaffolds and sizes, causing drug synergy data to exhibit out‑of‑distribution (O.O.D.) shifts with respect to topological structure. Existing works rely on in‑distribution (I.D.) assumption, failing to handle the O.O.D. shifts. To solve this problem, we study out‑of‑distribution generalized drug synergy prediction through a graph large language model for the first time. Nevertheless, O.O.D. generalized DSP is highly non‑trivial, posing several challenges: i) how to discover structurally relevant and irrelevant molecular representations with respect to cell targets; ii) how to find the optimal graph neural architectures that accurately calculate molecular representations; and iii) how to jointly leverage molecular structural and semantic information in LLMs. To address these challenges, we propose OOD‑GraphLLM, a novel graphLLM framework which is able to accurately predict drug synergy under O.O.D. settings via jointly optimizing molecular graph representation and biomedical semantic language representations in a unified manner. Furthermore, we finetune DrugSyn‑LLM, a biomedical LLM, and employ a retrieval‑augmented biomedical instruction tuning strategy to align molecular topological information and molecular semantic information with language‑based reasoning for O.O.D. generalized DSP. Both the source code (https://github.com/EkkoXiao/Bio‑GraphLLM) and released model (https://mn.cs.tsinghua.edu.cn/bio‑graphllm/) are publicly available, where users are allowed to download model resources and interactively use the system through a web interface.

Authors:Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng
Title: When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
Abstract:
Long‑horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while isolating task‑irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed‑world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn‑level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief‑tracking prompts provide limited gains. In contrast, reinforcement learning with belief‑state rewards reduces failure rates by 70.9% on average. Further probing reveals latent belief‑state dynamics behind these failures, and representation‑level steering reduces failure rates by 46.1% across two tasks\footnoteCode is coming soon at https://github.com/zjunlp/CBM.

Authors:Travis Lelle
Title: Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection
Abstract:
We show that LoRA adapters, the dominant distribution format for fine‑tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt‑injection classifier, a small fraction of poisoned examples drives a clean‑accuracy‑preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base‑model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi‑seed adapter cohort. A behavioral detector built from two probe‑battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight‑level statistic, the cross‑module standard deviation of dimension‑normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid‑to‑late layers, with down_proj as the strongest single‑projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight‑level detector is calibration‑bound to the base model. The attack scales monotonically with rank, and the chosen trigger‑anchor token is both trigger‑dependent and base‑model‑dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

Authors:Gijs van Nieuwkoop, Siamak Mehrkanoon
Title: Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression
Abstract:
Deep‑learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi‑quantile regression problem. Using SmaAt‑UNet as a core model, we compare MSE, MAE, and multi‑quantile pinball‑loss training on radar precipitation nowcasting over the Netherlands. The results show that multi‑quantile training improves the central deterministic forecast, decreasing test‑set MSE by 8.6% compared to a model trained using MSE, while also producing upper‑quantile outputs that are useful for risk‑sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \hrefhttps://github.com/gijsvn/Multi‑Quantile‑Precipitation‑NowcastingGitHub.

Authors:Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang, Xiangyu Fan, Lei Yang, Dahua Lin, Xianglong Liu
Title: SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation
Abstract:
Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few‑step video diffusion models. However, DMD‑style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse‑KL‑style matching can be mode‑seeking and conservative for preserving strong motion dynamics. To address these issues, we propose Score Gradient Matching Distillation (SGMD). SGMD adopts a fake‑score perspective by directly optimizing the fake score toward the teacher, while using teacher stop‑gradient Fisher as a stable distribution‑matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative‑residual (NR) for outer‑loop correction and residual‑contraction (RC) for inner‑loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately ~ 3× training speedup and substantially improves motion dynamics for 4‑step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

Authors:Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan
Title: Conformal Certification of Reasoning Trace Prefixes
Abstract:
Language model reasoning traces are rarely all‑or‑nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier‑agnostic calibration procedure for clean‑prefix certification. Given any step‑level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process‑labeled reasoning datasets, we demonstrate that standard step‑level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over‑ and under‑withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

Authors:Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye
Title: Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models
Abstract:
Diffusion models generate highly realistic images but often struggle with precise text‑image alignment. While recent post‑training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward‑free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text‑image representation alignment, outperforming standard parameter‑efficient fine‑tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over‑counting and repetition. To address this issue, we propose a lightweight, reward‑free post‑training method that refines soft tokens by integrating contrastive alignment guidance directly into the score‑matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL‑based diffusion post‑training methods. Project page: https://jaayeon.github.io/AGSM

Authors:Víctor Gallego
Title: Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas
Abstract:
We study two‑level autoresearch for cooperation: an outer‑loop AI agent autonomously redesigns the inner‑loop pipeline of an LLM policy‑synthesis system for multi‑agent Sequential Social Dilemmas (SSDs). A researcher agent \mathcalR (run as a coding agent) reads the inner‑loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy‑synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand‑designed baselines, sharply tightens run‑to‑run variance, and outperforms prompt‑only optimization. The discovered pipelines are objective‑dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective‑agnostic system prompt and from every efficiency‑optimized pipeline. This supports an information‑design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch‑social‑dilemmas.

Authors:Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr
Title: Genetically Aligned Patient Representations Improve Hematological Diagnosis
Abstract:
Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single‑cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two‑stage approach: (i) self‑supervised, vision‑only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide‑level histopathology foundation models. Additionally, the model provides off‑the‑shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology‑specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

Authors:Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian
Title: OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation
Abstract:
Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype‑centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in‑distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow‑level skills. To improve out‑of‑distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state‑of‑the‑art micro‑averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB‑NL, a highly challenging large‑scale and high‑dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek‑V3.2‑Thinking by 4.53%. After skill learning on Nano‑CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

Authors:Leyi Qi, Yiming Li, Siyuan Liang, Zhengzhong Tu, Dacheng Tao
Title: Cert-LAS: Toward Certified Model Ownership Verification for Text-to-Image Diffusion Models via Layer-Adaptive Smoothing
Abstract:
Large‑scale text‑to‑image (T2I) diffusion models have enabled unprecedented creative applications, but their unauthorized use has raised serious intellectual property concerns, making model ownership verification (MOV) increasingly critical. We find that existing backdoor‑based diffusion watermarking methods often (implicitly) assume a "faithful" verification process, namely, that the verifier can query a suspicious model and obtain the faithful watermark response to complete MOV. However, in practice, adversaries may intentionally or unintentionally damage potential watermark signals, significantly degrading verification reliability. To address this issue, we propose Cert‑LAS, the first certified MOV method for T2I models based on layer‑adaptive smoothing. In general, Cert‑LAS embeds specified watermarks using diffusion classifiers and an LFS‑guided layer‑adaptive noise, and verifies ownership by examining whether the suspected model exhibits significantly stronger watermark responses compared to unwatermarked references through hypothesis testing. We further prove that, under certain conditions, our Cert‑LAS can still achieve reliable verification even in the presence of malicious removal attacks. Extensive experiments validate the effectiveness of Cert‑LAS and its resistance to adaptive attacks. Our code is available at https://github.com/Leyi‑Qi/Cert‑LAS.

Authors:Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su
Title: SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
Abstract:
Agentic search enables LLMs to solve complex multi‑hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self‑awareness leads to severe over‑search, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self‑awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search‑disabled and search‑enabled rollouts; (ii) a boundary‑aware reward module, which translates this boundary awareness into trajectory‑level penalties, suppressing unnecessary and redundant searches; and (iii) a stage‑wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over‑search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.

Authors:NamGyu Jung, Chang Choi
Title: Learning Context-Conditioned Predicate Semantics via Prototype Feedback
Abstract:
In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image‑specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context‑conditioned predicate semantics via prototype feedback. AlignG infers context‑conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG‑150 and GQA‑200 show consistent improvements over state‑of‑the‑art baselines, with F@100 improvements of +1.4 on VG‑150 and +2.7 on GQA‑200 under SGDet. We further visualize per‑image prototype similarity shifts and observe coherent context‑dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at https://github.com/Namgyu97/AlignG‑SGG.pytorch.

Authors:Runang He, Tongya Zheng, Huiling Peng, Yuanyu Wan, Bingde Hu, Jiawei Chen, Canghong Jin, Mingli Song, Can Wang
Title: Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly Detection
Abstract:
Ever‑evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast number of addresses and diverse anomalous behaviors. Recently, advanced Graph Anomaly Detection (GAD) approaches applied to blockchains have faced two critical challenges: adversarial pattern evolution by malicious actors and the out‑of‑distribution (OOD) problem caused by varied transaction semantics on blockchains. To address these challenges, we propose a novel framework termed TEmporal Motif‑aware Graph Test‑Time Adaptation (TEMG‑TTA). First, we comprehensively capture the 3‑node temporal motif distribution of each active address using an efficient computational mechanism, enabling downstream temporal motif‑aware graph learning. Second, we design a simple yet effective test‑time adaptation strategy to facilitate the sharing of common patterns between training and testing graphs. Extensive experiments on 5 real‑world datasets demonstrate that our proposed TEMG‑TTA outperforms state‑of‑the‑art GAD approaches by an average of 54.88%. A further case study on interpretable motif patterns reveals that TEMG‑TTA explicitly characterizes the complex transaction patterns of anomalous addresses, thereby verifying the effectiveness of our technical designs. Our code will be made publicly available https://github.com/LuoXishuang0712/TEMG‑TTA/.

Authors:Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang
Title: On-Policy Replay for Continual Supervised Fine-Tuning
Abstract:
Continual supervised fine‑tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on‑policy signals ‑‑ training on the model's own outputs ‑‑ reduce forgetting more reliably than off‑policy supervision. Existing on‑policy methods route this signal through a new training objective (e.g., self‑distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on‑policy signal through the training data source. Our method, On‑Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on‑the‑fly distillation. Across three 7‑‑8B instruction‑tuned backbones (Qwen2.5‑7B‑Instruct, Qwen3‑8B, Llama3.1‑8B‑Instruct) on the TRACE continual‑learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5‑7B‑Instruct, Sequential SFT BWT ‑13.93), OPR lifts BWT to ‑0.65 at a 10% replay budget and to ‑2.29 at a 1% budget ‑‑ a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42‑‑46% reductions observed across all three backbones. We give a KL‑shrinkage interpretation that places OPR and prior on‑policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low‑score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on‑policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

Authors:Rohan Shravan
Title: Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models
Abstract:
Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte‑level character‑position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91‑‑94% of input‑side trainable parameters at frontier scale. We provide five contributions. First, a cross‑model probe across six LMs (135M‑671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three‑seed comparison on nanoGPT GPT‑2 124M over 2.5B tokens of FineWeb‑Edu shows Kronecker reaching 2.5 +‑ 0.2% lower validation loss than the BPE‑tied baseline (gap 0.083 +‑ 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling‑robustness probe over 110 clean/typo pairs shows Kronecker preserves the top‑1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte‑novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on‑the‑fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01‑‑0.24% step‑time overhead. Byte‑level locality has a tradeoff: byte‑similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.

Authors:Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic
Title: GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
Abstract:
Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre‑training, these approaches introduce bias through training‑‑inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self‑Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage‑guided self‑teacher, derived from the closed‑form optimum of reverse‑KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization‑free objective, which reduces RL to likelihood‑free self‑distillation and thus bypasses the TIM biases. Recent ELBO‑based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA‑8B and Dream‑7B, GDSD consistently outperforms prior state‑of‑the‑art ELBO‑based methods with a more stable training reward dynamics, achieving test‑accuracy improvements of up to +19.6%. These results suggest that direct denoiser self‑distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

Authors:Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani
Title: TRACER: Persistent Regularization for Robust Multimodal Finetuning
Abstract:
Mainstream strategies for finetuning pretrained multimodal models often degrade out‑of‑distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed‑form solutions and a geometric decomposition for each strategy. This framework shows that self‑distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias‑free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate TRACER (Trajectory‑Robust Anchoring for Contrastive Encoder Regularization), which combines contrastive learning with WMA‑guided multi‑perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

Authors:Rohan Shravan
Title: BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base
Abstract:
We present BrahmicTokenizer‑131K, a 131,072‑vocabulary byte‑level BPE tokenizer that closes the Brahmic compression gap at the 131K‑vocabulary class while preserving the English, EU‑language, and code compression of OpenAI's o200k_base. We construct it through a two‑stage retrofit: (1) a script‑prune crop that reduces 200,019 tokens to 131,072 by removing nine out‑of‑scope writing systems, and (2) a surgical retrofit of 2,372 corpus‑dead vocabulary slots determined by linear‑programming allocation across nine Brahmic Unicode blocks. The pre‑tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer‑131K a drop‑in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer‑131K produces 26.7% fewer tokens than Mistral‑Nemo Tekken / Sarvam‑m at the same vocabulary budget, with per‑language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam‑m containing zero Oriya‑block tokens; our surgery added 725. On non‑Indic content, BrahmicTokenizer‑131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam‑m by 4.0‑14.2% on HumanEval, MBPP, and GSM8K. Across our 14‑tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam‑30B, Sarvam‑1, MUTANT‑Indic) achieve better Indic compression at the cost of non‑Indic performance: Sarvam‑1's English fertility is 15.9% worse and its code/math compression 26‑33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer‑131K.

Authors:Kyuil Sim, Sanghyeok Choi, Jinkyoo Park
Title: Solving Integer Linear Programming with Parallel Tempering
Abstract:
Integer Linear Programming (ILP) serves as a versatile framework for modeling a wide range of combinatorial optimization problems, typically addressed by sophisticated exact solvers or heuristics. While learning‑based approaches have recently shown their effectiveness, they suffer from poor generalization to out‑of‑distribution instances and inherent dependence on external solvers. In this work, we propose a solver‑free, sampling‑based optimization framework for ILP that directly explores discrete feasible regions without training or external solvers. Exploiting the linear structure of ILP, we employ a Locally‑Balanced Proposal to construct a transition kernel, thereby avoiding the gradient approximation. To overcome the highly multimodal nature of ILP energy landscapes, we integrate Parallel Tempering. In addition to standard temperature tempering, we introduce penalty tempering, which modulates constraint barriers while preserving the objective landscape over feasible solutions. Empirically, our method consistently outperforms SCIP across all four benchmarks, matches or exceeds Gurobi on two of four tasks within a 200‑second budget, and is substantially more robust to distribution shift than learning‑based methods. Furthermore, on MIPLIB 2017 instances, our framework remains competitive with classical solvers without any problem‑specific tuning.

Authors:Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao
Title: PassNet: Scaling Large Language Models for Graph Compiler Pass Generation
Abstract:
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long‑tail workloads ‑‑ our profiling shows that 43% of real‑world subgraphs experience end‑to‑end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation ‑‑ where LLMs author structured graph transformations that integrate directly into compiler pipelines ‑‑ is the more appropriate abstraction. We propose PassNet, the first large‑scale ecosystem for LLM‑based compiler pass generation, comprising: (1) PassNet‑Dataset, over 18K unique computational graphs from 100K real‑world models; and (2) PassBench, 200 curated long‑tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error‑aware Speedup Score (ES_t) ‑‑ a metric unifying correctness, stability, and performance ‑‑ with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler ‑‑ indicating that the bottleneck is consistency, not capability. Fine‑tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier‑model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM‑driven compiler optimization. All data, benchmarks, and tooling are publicly available.

Authors:Arunkumar Ramachandran
Title: SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction
Abstract:
Alarm fatigue in intensive care units (ICUs) is a well documented patient safety crisis. Clinical monitors generate 350 or more alarms per patient per day, out of which 72‑99% are clinically irrelevant. Staff desensitization to non‑actionable alarms increases the risk of missed true emergencies. This paper presents SigmaMedStat, a machine learning system that evaluates the trustworthiness of physiological alarm signals before clinical action is taken. Four approaches were evaluated on the PhysioNet/Computing in Cardiology Challenge 2015 dataset of 498 four‑channel ICU alarm recordings. Primary contribution is a temporal modeling framework that splits each 60 second recording into six consecutive 10‑second chunks, and this in turn generates Continuous Wavelet Transform (CWT) scalograms per chunk, encodes each chunk with a shared EfficientNet‑B0 encoder, and passes the resulting feature sequence to a two‑layer Long Short‑Term Memory (LSTM) network. Five‑fold stratified cross‑validation yields a mean AUC of 0.822 +/‑ 0.016 (95% CI: [0.790,0.853]), compared to 0.641 for a static EfficientNet baseline trained on the full 60‑second window. Ablation studies confirm that temporal chunking and multi‑channel signal fusion both contribute independently to classification performance. Per‑alarm type analysis reveals that Ventricular Flutter is the most accurately classified alarm type (AUC 0.820) while Asystole remains the hardest (AUC 0.722). Error analysis identifies 65 false negatives and 85 high‑confidence misclassifications as the primary failure modes. All code and results are publicly available at https://github.com/Arun‑K‑Ram/sigmamedstat.

Authors:Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He
Title: OISD: On-Policy Internal Self-Distillation of Language Models
Abstract:
Recent reinforcement learning (RL) post‑training approaches primarily optimize the final output policy using sparse outcome‑level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on‑policy internal self‑distillation and propose the OISD framework, which improves reasoning by transferring on‑policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high‑level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage‑weighted Jensen‑‑Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE‑MALT‑LAB/OISD

Authors:Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang, Neil Zhenqiang Gong, Tianlong Chen, Dawn Song
Title: Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening
Abstract:
LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real‑world LLM‑based applications are largely unexplored. In this work, we present the first systematic study of prompt‑injection attacks in a widely used application: LLM‑based resume screening. Our analysis is based on approximately 200K real‑world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small‑scale dataset demonstrates that our detectors achieve high precision and outperform state‑of‑the‑art general‑purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real‑world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large‑scale prompt injection in real‑world LLM‑based applications and lay the groundwork for future studies to understand and mitigate such attacks.

Authors:Nicolas Gillis, Subhayan Saha, Stefano Sicilia, Arnaud Vandaele
Title: Manifold-based Algorithms for the Hadamard Decomposition
Abstract:
Given a matrix X, and two ranks r_1 and r_2, the Hadamard decomposition (HD) looks for two low‑rank matrices, X_1 of rank r_1 and X_2 of rank r_2, both of the same size as X, such that X\approx X_1\circ X_2, where \circ is the Hadamard (element‑wise) product. In most cases, HD is more expressive than standard low‑rank approximations such as the truncated singular value decomposition (TSVD), as it can represent higher‑rank matrices with the same number of parameters; this is because the rank of X_1 \circ X_2 is generically equal to r_1 r_2. In this paper, we first present some theoretical insights for HD, in particular a useful reformulation X\approx WH^\top where W and H have r_1 r_2 columns and belong to certain manifolds. These allow us to develop three new algorithms for computing HD. The first one uses the representation X\approx X_1\circ X_2 and relies on the Manopt toolbox. The other two rely on the reformulation X\approx WH^\top: one is a block projected gradient method, and the other is a manifold‑based gradient descent algorithm that does not require projection onto the feasible set. The last two algorithms are particularly effective for handling large sparse data. We also propose new initializations that allow us to improve the accuracy of the HD. We compare our algorithms and initialization strategies with the TSVD and with the state of the art. Numerical results show that the new methods are efficient and competitive on both synthetic and real data.

Authors:Venkat Akhil Lakkapragada
Title: CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models
Abstract:
Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish‑HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high‑level and low‑level reasoning cycles and learns when to halt based on input complexity. CosmicFish‑HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non‑uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.

Authors:Clement Etienam, Juntao Yang, Oleg Ovcharenko, Nick Luiken, Tsubasa Onishi, Nefeli Moridis, Issam Said
Title: Sequential Physics-Constrained Neural Operator Forward Modeling for the $\textit{Norne}$ Reservoir System
Abstract:
We develop a comprehensive mathematical and computational framework for sequential surrogate modeling of three‑phase black‑oil reservoir dynamics using neural operators, with particular emphasis on Fourier Neural Operators (FNO) and their physics‑informed variant (PINO). The application focus is the Norne benchmark reservoir, defined on a heterogeneous 46×112×22 grid (N=113,344 cells), with a production history spanning T=30 timesteps covering 3298 days. Our theoretical contributions are organized around four interlocking problems: (1) functional‑analytic formulation in a product‑Sobolev‑space setting, including well‑posedness of the implicit timestep map and sharp local Lipschitz estimates; (2) covariate shift quantification, proving that the Wasserstein‑2 distance grows as W_2 \leq \varepsilon(L^n‑1)/(L‑1), with exponential population‑risk discrepancy for L>1; (3) physics‑constrained spectral stability, showing PINO training with λ_R \geq λ^_R reduces the learned Jacobian spectral radius to ρ_F + Cλ_R^‑1/2, yielding uniform‑in‑time rollout error |δ_n| \leq \varepsilon/(1‑ρ); and (4) K‑step TBPTT gradient analysis, deriving geometric bias decay O(ρ^K), optimal window K^ = O(\log(T/σ^2)), and Adam convergence O(1/\sqrtt) + O(ρ^K^). Empirical validation confirms all theoretical predictions: autoregressive PINO surrogates sustain R^2>0.99 (oil), R^2>0.90 (gas), R^2\approx 0.80 (pressure), and monotonically improving R^2 (water) across the full 3298‑day horizon, trained on eight NVIDIA B200 GPUs in under one hour. A 1000‑member ensemble runs in under one minute on a single B200 GPU, giving a ~10^4× wall‑clock speedup over the OPM finite‑volume simulator.

Authors:Sicong Wang, Ruiting Dong, Yue Liu, Bowen Zheng, Jun Meng, Jie Li, Shuaijun Guo, Yu Gu, Fanyi Di, Xin Li
Title: Generative Spatiotemporal Intent Sequence Recommendation via Implicit Reasoning in Amap
Abstract:
Real‑world user behavior rarely consists of isolated actions; instead, it often forms intent flows governed by spatiotemporal dependencies. To provide integrated service recommendations, we focus on the task of Generative Spatiotemporal Intent Sequence Recommendation (GSISR), which aims to generate intent sequences that are logically coherent and physically executable within complex spatiotemporal contexts. While LLMs offer strong reasoning potential for GSISR, direct industrial deployment is limited by high inference latency and context‑mismatched or physically infeasible plans. To address these challenges, we propose a generative framework, GPlan, that internalizes LLM reasoning into lightweight models through two components. First, to enable reasoning under strict latency constraints, we introduce Progressive Implicit CoT Distillation, which compresses explicit reasoning processes into reserved latent tokens, allowing small models to inherit complex planning logic without generating long reasoning text. Second, to address the disconnect between general knowledge and real‑world constraints, we design Spatiotemporal Counterfactual DPO. By aligning the model with counterfactual context‑plan pairs, we improve sensitivity to spatiotemporal context and reduce context‑mismatched plans. Offline experiments and online A/B testing demonstrate that our approach improves sequence coherence and context responsiveness. Our implementation and the anonymized GSISR dataset are available at https://github.com/alibaba/GPlan.

Authors:Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary
Title: Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
Abstract:
Fine‑tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine‑tuning (SFT), attributing this to policy‑gradient updates remaining closer to the base policy \citeshenfeld2025rl. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head‑level measure of how much a circuit degrades under fine‑tuning, and use it to compare RL and SFT on Qwen2.5‑3B‑Instruct adapted to scientific question‑answering. We find a clear mechanistic trade‑off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl‑sft‑circuit‑research/differential‑circuit‑vulnerability.

Authors:Weicheng Xue
Title: Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents
Abstract:
We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. TradeArena, an auditable trading‑agent testbed with risk reports, execution simulation, memory, and replayable trajectories, lets us analyze how rationales, positions, and interventions evolve under market stress. Code and data artifacts are available through the \hrefhttps://github.com/weich97/TradeArena.gitTradeArena repository. We find pre‑failure signatures: planning embeddings drift from normal centroids, fused plan‑risk representations separate normal from pre‑drawdown states, and local manifolds exhibit effective‑rank contraction. Across 80 rolling failure anchors and eight LLM trajectories, this pattern persists across hash, LSA, Transformer, and white‑box hidden‑state probes. Stress tests with CoT‑free target weights, lexical controls, OHLCV noise, and false audits show that rationale‑level contraction can vanish without rationales, while intent‑space and fused signatures remain informative. Structured risk feedback can act as an external alignment signal without fine‑tuning, but not as a universal performance enhancer: true audit feedback improves calibration for some models, returns for others, and exposes cases where placebo or hidden feedback has higher short‑horizon return but weaker alignment diagnostics. A 51‑stock intraday experiment reveals a correlation blind spot: LLM rationales justify exposure to coupled assets that the risk layer clips. Finally, a financial‑audit task suite shifts comparison from ``which model trades best'' to whether models can audit trajectories, respect execution boundaries, reproduce artifacts, and avoid claim overreach. These results support a research claim, not a profitability claim: auditable risk feedback and representation trajectories reveal when LLM financial reasoning is aligning, drifting, or failing.

Authors:Tirtharaj Dash
Title: BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks
Abstract:
Tabular data in knowledge‑rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse‑exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2‑literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most 2/d of the weights in each BIR layer are active, where d is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to 96× fewer active parameters than an architecture‑matched dense MLP. First‑layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage‑defining co‑expression modules, and immune‑infiltration markers. Data and code are available at: https://github.com/MAHI‑Group/BIRDNet.

Authors:Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang
Title: MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
Abstract:
Memory is essential for enabling large language models to support long‑horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine‑grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long‑Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation‑level issues like information loss and retrieval misalignment. Crucially, we leverage these fine‑grained attribution signals to guide downstream prompt optimization, establishing a closed‑loop system that automatically corrects faults and boosts end‑task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

Authors:Krishnam Gupta
Title: How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures
Abstract:
We discover that VLA architectures fail in fundamentally different, predictable ways at the motor‑command level. Running VQ‑BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14‑DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete‑token architectures, following a discrete‑to‑continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non‑predictive everywhere (AUROC 0.41‑0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous‑family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture‑matched monitor selection is essential. These results quantify a monitoring consequence of the well‑known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture‑matched selection is required. This finding was enabled by SafeContract, a training‑free, black‑box action monitoring toolkit with conformal calibration. Code: https://github.com/krishnam94/vla‑edge

Authors:Haonan Wen, Hanyang Chen, Songhe Feng
Title: Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration
Abstract:
Irregular multivariate time series forecasting is critical in many real‑world applications, where time series are irregularly sampled and exhibit dynamically evolving missingness patterns. Although existing methods perform well in offline settings, they often suffer from significant performance degradation when deployed online due to dynamic shifts in data distribution. Maintaining forecasting capability in such dynamic scenarios typically necessitates online adaptation techniques. Since irregular sampling fundamentally undermines temporal continuity and periodicity, we cannot leverage these widely studied characteristics from regular MTS for online learning. To this end, we study the problem of online IMTS forecasting and propose Under‑Cali, an uncertainty‑driven dual‑expert calibration framework consisting of three core components: an uncertainty estimator, a dual‑expert calibration module, and an adaptive routing module. We design an uncertainty estimator that serves as the core control signal to jointly manage inference and adaptation processes. In our framework, the uncertainty estimator first assesses uncertainty for each incoming batch. The adaptive routing module then directs samples with high uncertainty to the unreliable expert for calibration, while low uncertainty samples remain with the reliable expert. Subsequently, the system updates the reliable expert and the uncertainty estimator using well‑calibrated reliable samples, and updates the unreliable expert with challenging samples, enabling stable and efficient online learning. Under‑Cali keeps the source forecasting model frozen and performs adaptation only through a lightweight, model‑agnostic calibration module, enabling efficient adaptation. Extensive experiments on IMTS benchmarks demonstrate consistent improvements with low computational cost. Our code is available at https://github.com/HaonanWen/Under‑Cali.

Authors:Satoshi Tanaka, Takahiro Nishimichi, Yosuke Kobayashi
Title: Dark Quest II: A Wide-Coverage Neural Network Emulator of the Nonlinear Matter Power Spectrum Across Extended Cosmologies
Abstract:
\textscDarkEmulator2 is a neural network emulator of the nonlinear matter power spectrum in a nine‑dimensional w_0 w_a νo \mathrmCDM parameter space, developed as the emulator component of the \textscDark Quest II (DQ2) program. It is trained on simulations generated with the \textscGinkaku code, whose numerical implementation, accuracy tests, and post‑processing pipeline are described in the companion paper. The design follows a unified strategy: in addition to the cosmological parameter vector, we supplement the neural network's inputs with three families of physically motivated auxiliary quantities ‑‑ the linear matter power spectrum, descriptors of the simulation resolution, and a low‑dimensional summary of the initial Gaussian random field ‑‑ that are expected to improve generalization across the parameter space. Training a single network jointly across three simulation resolution tiers allows the emulator to exploit a small number of high‑resolution simulations while retaining broad coverage from lower‑resolution simulations. For a L_\mathrmbox=1\,\hiGpc box with N=3000^3 particles, the emulator reproduces the simulated matter power spectrum to subpercent accuracy up to the particle Nyquist scale, k_\mathrmNy~eq 10\,\hMpci. The emulator remains accurate over the calibrated wavenumber range, while its highest‑k predictions depend on the simulation resolution and shot noise. We validate the emulator on independent test suites and, through a cross‑comparison with several public emulators and widely used fitting formulas, characterize the inter‑model consistency and the parameter‑dependent trends in their residuals.

Authors:José Lucas De Melo Costa, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan
Title: High Performance, Low Reliability: Uncertainty Benchmarking for Tabular Foundation Models
Abstract:
Recent Tabular Foundation Models (TFMs) have demonstrated state‑of‑the‑art predictive performance, often surpassing Gradient‑Boosted Decision Trees (GBDTs). However, the trustworthiness of these models, particularly their uncertainty quantification, has been largely overlooked. We investigate this gap through an extensive study comparing TFMs, GBDTs, and classical baselines on the 112 datasets of the TALENT benchmark. Our results reveal a performance‑uncertainty trade‑off: although TFMs achieve the highest predictive performance, measured by AUC, they exhibit lower conditional coverage under conformal prediction, measured by SSCS, compared to GBDTs. Complementary experiments on synthetic datasets further characterize the regimes in which this effect intensifies. We conclude that while TFMs advance predictive frontiers, achieving well‑calibrated uncertainty remains a major open challenge for their reliable adoption. Code is available at: https://github.com/jose‑melo/high‑performance‑low‑reliability

Authors:Alan Ferrari
Title: Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
Abstract:
Standard transformer architectures apply a single attention mechanism uniformly across all tokens and sequence positions, irrespective of local context or computational budget. We propose Meta‑Attention, a framework that dynamically routes each token to the most appropriate attention strategy ‑‑ full softmax attention, linear (kernel) attention, or sliding‑window local attention ‑‑ via a Bayesian Meta‑Controller. Unlike prior routing approaches that use deterministic or prior‑free learned routing, the Meta‑Controller treats per‑token mechanism selection as posterior inference under a compute‑aware Dirichlet prior: routing weights are the output of an amortised variational posterior q(alpha | x_t; phi) trained with an Evidence Lower Bound (ELBO) objective that jointly encodes task performance and attention‑mechanism cost. This design produces principled routing uncertainty estimates that govern the soft‑to‑hard routing transition, mitigates routing collapse without ad hoc load‑balancing losses, and yields better compute‑performance trade‑offs than deterministic or prior‑free learned routing at negligible overhead. Phase 1 empirical results on a Tiny LM benchmark confirm core predictions: the Bayesian controller's learned routing distribution implies a projected normalised FLOP cost of 25.1% under hard routing, vs. 59.3% for the prior‑free baseline (‑34.2 pp), and reduces routing entropy from 55.8% to 43.3% (‑12.5 pp), demonstrating that the Dirichlet prior prevents routing collapse while the non‑Bayesian model defaults to full attention. We present the Bayesian architecture, ELBO training objective, and a Phase 1 PyTorch prototype validating forward‑pass correctness, posterior diversity, and a controlled ablation against a prior‑free baseline. Code available at: https://github.com/KFEAL/meta‑attention

Authors:Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang
Title: ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation
Abstract:
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short‑term acceptance and long‑term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path‑level rewards decompose into step‑level rewards with positive mean, creating a length‑dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path‑level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length‑dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position‑Specific Advantage Estimation leverages the reward decomposition structure to compute step‑dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real‑world datasets demonstrate that ProRL significantly outperforms state‑of‑the‑art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.

Authors:Evgenii Palnikov, Elizaveta Gavrilova
Title: Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation
Abstract:
We study quality‑latency‑resource trade‑offs in a documentation‑grounded retrieval‑augmented generation (RAG) system that uses Low‑Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question‑answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid‑retrieval pipeline (BGE‑M3 dense, BGE‑M3 native sparse, Reciprocal Rank Fusion, cross‑encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama‑3.2‑3B‑Instruct and Llama‑3.1‑8B‑Instruct across rank and target‑module choices, and evaluate each on token‑level F1, LLM‑judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param‑matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at https://github.com/EugPal/rag‑lora‑tradeoffs.

Authors:Lei Zhang, Fubo Sun, Haipeng Yang, Zhong Guan, Likang Wu
Title: Robust Contrastive Graph Clustering with Adaptive Local-Global Integration
Abstract:
Graph clustering is essential in graph analysis for revealing structural patterns and node communities. Despite recent advances in self‑supervised contrastive learning that have improved clustering via structural and attribute signals, existing methods still struggle to flexibly capture high‑order local structures and often overlook global semantics in complex graphs. These limitations lead to suboptimal node representations, especially in real‑world graphs with fragmented structures and ambiguous cluster boundaries. To address these limitations, a contrastive graph clustering framework is proposed to jointly integrate multi‑scale local structures with global semantics via attention mechanisms. At the local level, GNN‑based topological signals extracted from multiple propagation depths are adaptively fused through attention‑based weighting to capture multi‑scale neighborhood features. At the global level, semantic prototypes derived from dynamically evolving cluster centers are adaptively aggregated through attention to guide node representations and enhance inter‑cluster separability. The model is trained under a dual‑view contrastive learning paradigm with a hybrid objective that combines instance‑level and structure‑aware losses to improve representation robustness and discrimination. Experiments on eight real‑world graph datasets demonstrate that our method achieves competitive clustering performance. Code is available at https://github.com/vege12138/w2.

Authors:Junghoon Lim
Title: QuITE: Query-Based Irregular Time Series Embedding
Abstract:
Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input‑embedding‑based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. In this work, we introduce QuITE (Query‑Based Irregular Time Series Embedding), a simple yet effective plug‑and‑play embedding module for IMTS. QuITE employs learnable query tokens to aggregate irregular observations through a single self‑attention layer, directly producing backbone‑compatible latent representations without artificial value generation or architectural modification. Extensive experiments on real‑world benchmarks show that QuITE consistently improves MTS models, yielding average relative gains of up to 54.7% in forecasting and 15.8% in classification across diverse datasets and backbone architectures. Code is available at: https://github.com/Meaningfull9502/QuITE.

Authors:Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang
Title: Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization
Abstract:
Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration‑exploitation trade‑off, resulting in unstable optimization and sub‑optimal performance. We introduce IB‑Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration‑exploitation balance by quantifying the trade‑off between step‑level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB‑Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck‑driven Tree‑based Policy Optimization (IB‑TPO), a principled framework that formulates IB‑Score as a fine‑grained optimization objective and utilizes a novel IB‑guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB‑Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state‑of‑the‑art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

Authors:Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang
Title: SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection
Abstract:
Fine‑tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine‑tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety‑Projected Alternating optimization with Relevance‑Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance‑Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine‑tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state‑of‑the‑art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

Authors:Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu
Title: ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains
Abstract:
On‑policy self‑distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token‑level supervision for on‑policy rollouts. However, existing OPSD methods often yield limited gains on in‑domain reasoning and generalize poorly to out‑of‑domain problems. We identify two key causes: conditioning the self‑teacher on a verified solution encourages imitation of training‑domain reference trajectories rather than error‑specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On‑policy Self‑Distillation (ROSD), a framework that turns reference‑solution imitation into targeted reasoning correction through reflection‑guided, error‑localized distillation. For each rollout, ROSD uses a self‑reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self‑teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in‑domain and out‑of‑domain reasoning benchmarks show that ROSD yields stronger in‑domain reasoning performance overall and substantially better out‑of‑domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

Authors:Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim
Title: Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping
Abstract:
Diffusion posterior sampling conditions diffusion priors on measurements, but data‑consistency updates are typically scaled by hand‑tuned guidance weights and can destabilize sampling under stiff, operator‑dependent curvature. We replace scalar guidance with a per‑noise‑level damped Gauss‑‑Newton correction computed in diffusion‑state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one‑sided curvature model that avoids forward denoiser Jacobians, and applies diffusion‑calibrated rank‑one damping aligned with the denoiser residual. Each correction is solved with matrix‑free GMRES using automatic differentiation, and sampling proceeds with a variance‑preserving Langevin transition with a closed‑form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

Authors:Junlin Wang
Title: Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal
Abstract:
Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high‑frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion‑based policies, where iterative denoising steps can inadvertently amplify high‑frequency artifacts at the expense of meaningful fine‑grained details. To address these limitations, we present a novel frequency‑based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub‑frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: https://henrywjl.github.io/frequency‑guidance‑operator/

Authors:Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro
Title: DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification
Abstract:
Claim verification splits between end‑to‑end classifiers that are accurate but yields no inspectable traces, and decomposition‑based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim‑verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi‑faceted reward ensemble, enabling both fully supervised and semi‑supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data‑curation funnel that distills 115K fact‑verification claims into a compact, learning‑signal‑dense subset of 5K claims. We show that a DecomposeRL‑7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in‑domain and 69.8 out‑of‑domain balanced accuracy across 11 claim‑verification benchmarks containing biomedical, political, scientific, and general‑domain claims. Despite being 4x smaller, it matches 32B baselines and GPT‑4.1‑mini, and it further outperforms baselines in a semi‑supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

Authors:Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani
Title: Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox
Abstract:
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language‑implied (incorrect) answers. To understand the cause of this gap, we perform layer‑wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder‑‑LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt‑Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language‑implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.

Authors:Liu Zhang, Amit Singer
Title: Robust Moment-Based Estimation via Spectral Gradient Reweighting
Abstract:
Moment‑based estimation is a theoretically attractive approach to parametric inference, especially when likelihood‑based estimation is unavailable, misspecified, or computationally inconvenient. However, the moment equations involve sample averages, which makes moment‑based estimation sensitive to outliers. We propose the SGR‑GMM algorithm, a robust generalized method of moments (GMM) procedure that uses a spectral gradient reweighting (SGR) primitive to soft‑reweight the per‑observation gradients during the moment‑matching optimization. Our analysis has three layers. First, for a fixed center, the SGR primitive is formulated as an entropy‑regularized spectral game between a sample‑weight player and a density‑matrix player, which is analyzed using classical multiplicative‑weights and matrix‑multiplicative‑weights regret bounds. Second, we establish explicit convergence radius and finite termination bound for the fixed‑center updates in the SGR primitive. Third, we prove a local finite‑sample parameter estimation error bound with explicit dependence on the contamination fraction, inlier gradient stability, local GMM identification strength, and optimization accuracy. We further specialize the SGR‑GMM algorithm to obtain a robust diagonally‑weighted GMM (DGMM) estimator for estimating heteroscedastic low‑rank Gaussian mixtures observed under additive Gaussian noise and strong contamination. In the numerical experiments, the SGR primitive produces nearly‑oracle gradient estimation and the robust DGMM specialization substantially improves over non‑robust moment baselines. The code and data are available at https://github.com/liu‑lzhang/sgr‑gmm.

Authors:Khang Tran, Yazan Boshmaf, Issa Khalil, NhatHai Phan, Ting Yu, Md Rizwan Parvez
Title: Poison with Style: A Practical Poisoning Attack on Code Large Language Models
Abstract:
Code Large Language Models (CLLMs) serve as the core of modern code agents, enabling developers to automate complex software development tasks. In this paper, we present Poison‑with‑Style (PwS), a practical and stealthy model poisoning attack targeting CLLMs. Unlike prior attacks that assume an active adversary capable of directly embedding explicit triggers (e.g., specific words) into developers' prompts during inference, PwS leverages developers' code styles as covert triggers implicitly embedded within their prompts. PwS introduces a novel data collection method and a two‑step training strategy to fine‑tune CLLMs, causing them to generate vulnerable code when prompts contain trigger code styles while maintaining normal behavior on other prompts. Experimental results on Python code completion tasks show that PwS is robust against state‑of‑the‑art defenses and achieves high attack success rates across diverse vulnerabilities, while maintaining strong performance on standard code completion benchmarks. For example, PwS‑poisoned models generate CWE‑20 vulnerable code in 95% of cases when the trigger code style is used, with less than a 5% drop in pass@1 performance on the HumanEval and MBPP benchmarks. Our implementation and dataset are here: https://github.com/khangtran2020/pws.

Authors:Tim R. Davidson, Anja Surina, Caglar Gulcehre
Title: The Future of Facts: Tracing the Factual Generation-Verification Gap
Abstract:
Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation‑verification gap (GV‑gap) underlies many recent advances in self‑improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV‑gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open‑source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi‑verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well‑covered facts.

Authors:Aurelio Amerio
Title: GenSBI: Generative Methods for Simulation-Based Inference in JAX
Abstract:
Flow and diffusion generative models have established themselves as widely adopted density estimators for simulation‑based inference (SBI), extending naturally from neural posterior estimation to likelihood and joint density estimation. Their principled optimization objectives and freedom from architectural constraints have driven rapid adoption across the natural sciences. Yet the most widely used SBI libraries remain PyTorch‑based, leaving researchers who develop their forward models and analysis pipelines in JAX without a native option. We present GenSBI, an open‑source library that implements flow matching, score matching, and denoising diffusion entirely in JAX. The library offers three transformer‑based architectures ‑ SimFormer, Flux1, and a novel Flux1Joint that extends gate‑modulated transformer blocks to joint density estimation ‑ all interchangeable through a unified interface that decouples generative method, neural backbone, and inference mode. GenSBI provides an end‑to‑end workflow from training through posterior calibration (SBC, TARP, LC2ST) and supports custom architectures with domain‑specific embedding networks. We validate the framework on standard SBI benchmarks, achieving near‑ideal mean C2ST scores (0.50‑0.56, where 0.50 is ideal) on SBIBM tasks with minimal per‑task tuning and well‑calibrated posterior coverage across all tested configurations. The code is publicly available at https://github.com/aurelio‑amerio/GenSBI.

Authors:Syed Huma Shah
Title: Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?
Abstract:
Modern retrieval‑augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time‑to‑first‑token(TTFT). Prefix‑level KV reuse is now standard in serving stacks such as vLLM, and chunk‑level and position‑independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output‑level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence‑validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved‑evidence overlap, source‑version validity, and lexical (or judge‑based) support of the cached answer by the freshly retrieved evidence. We build a six‑regime workload that stress‑tests cache safety rather than only hit rate, and introduce an operator‑facing metric, the unsafe‑served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real‑LLM generations(Qwen2.5‑7B‑Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15‑35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design‑point adversarial regime and 3‑10x reductions across the other mtRAG regimes, while end‑to‑end p50 latency stays within 1.04‑1.07x of a no‑cache RAG baseline. A per‑gate ablation isolates the lexical support gate as the load‑bearing safety mechanism on both datasets, with the remaining gates providing defense‑in‑depth at near‑zero cost. We release the implementation, workload, and evaluation harness.

Authors:Hyunmin Cho, Woo Kyoung Han, Kyong Hwan Jin
Title: Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective
Abstract:
We characterize the pre‑softmax attention matrix \mathbfQK^\top in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew‑symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew‑symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield‑style stability measures that quantify the stability of retrieved features. We observe meaningful correlations between Hopfield‑style stability measures and the fidelity‑diversity trade‑offs in generation. Finally, we propose a controllable knob to modulate this trade‑off by modifying the circulation of the underlying dynamics. Code is available at our GitHub (https://github.com/hyeon‑cho/Attention‑Symmetric‑Decomposition).

Authors:Nicole Koenigstein
Title: AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems
Abstract:
Multi‑agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one‑off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open‑source framework that treats multi‑agent coordination as an online policy‑learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed‑systems incident tasks and security‑advisory tasks. The evaluation shows three main results: learned routing reaches a higher‑quality operating point than a fixed pipeline baseline on coordination‑heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm‑started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination‑heavy multi‑agent workflows over static wiring.

Authors:Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang
Title: FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation
Abstract:
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene‑level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint‑based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self‑supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi‑class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero‑shot and long‑tail scenarios, underscoring its potential for scalable, label‑free 3D object segmentation.

Authors:Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao
Title: ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Abstract:
Fine‑grained Mixture‑of‑Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory‑constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine‑tuning framework designed to boost token‑wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short‑horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference‑time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real‑system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU‑CPU expert offloading and reducing TPOT by 43.6‑49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77‑1.99× decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA‑OSCAR/ReMoE.

Authors:Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee, Kai Yang, Yangkun Chen, Saiyong Yang, Yunfang Wu
Title: RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data
Abstract:
The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic‑source Tracing via Lineage‑Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage‑aware perspective. To this end, we propose Source‑level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per‑atomic‑source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held‑out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at https://github.com/Celine‑hxy/ATLAS.

Authors:Yali Fink, Ido Ben-Yair, Lars Ruthotto, Eran Treister
Title: RAPNet: Accelerating Algebraic Multigrid with Learned Sparse Corrections
Abstract:
The scalable solution of large sparse linear systems is a bottleneck in scientific computing and graph analysis. While algebraic multigrid (AMG) offers optimal linear scaling, its performance is severely constrained by the trade‑off between the sparsity and convergence quality of coarse‑grid operators. Classical AMG heuristics struggle to balance these objectives, often sacrificing stability or performance for sparsity. We propose RAPNet, a graph neural network (GNN) framework that resolves this trade‑off by learning to generate sparse, robust coarse operators directly from the sparse algebraic system. Key to our approach is a level‑wise training strategy that enables learning from small subgraphs and generalization to million‑node domains, bypassing the bottlenecks of prior neural AMG attempts. RAPNet executes exclusively during the solver setup phase, ensuring that the solve phase retains its favorable computational properties. We show that our method outperforms classical non‑Galerkin baselines on diverse PDE discretizations and graph Laplacians, making it particularly effective for multi‑query tasks such as eigenproblems, time‑dependent simulations, and inverse or design problems.

Authors:Gwangho Kim, Sungyoon Lee
Title: Localizing Memorized Regions in Diffusion Models via Coordinate-Wise Curvature Differences
Abstract:
Diffusion models can unintentionally memorize training samples, raising concerns about privacy and copyright. While recent methods can detect memorization, they often rely on global or model‑specific signals and provide limited insight into where memorization appears within a generated image. We provide a geometric characterization of local memorization as a coordinate‑wise variance collapse. However, such collapse can also arise from intrinsic data constraints rather than overfitting. To isolate overfitting‑driven memorization, we propose curvature‑difference methods that subtract the curvature of an underfitted baseline, either the unconditional model or a less‑trained version of itself. We further derive a score‑difference proxy that provides a geometric explanation for the widely used score‑difference‑based detection metric. Experiments on Stable Diffusion, evaluated against ground‑truth memorization masks, show that our method outperforms the prior attention‑based localization method. Code is available at https://github.com/Gwangho99/mem‑curv‑diff.

Authors:Ashima Khanna, Dominik Grimm
Title: Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets
Abstract:
Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off‑policy generative approaches often degrade under surrogate noise, and position‑agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory‑level self‑improvement imitation framework for oracle‑budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active‑learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB‑based proxy ensemble, combined with an alanine‑scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next‑action cross‑entropy imitation on the round's best oracle‑labeled trajectories, avoiding value‑function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top‑100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early‑stage improvement. In low‑data and noisy‑proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git

Authors:Shuang Liang, Chaochuan Hou, Xu Yao, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang
Title: Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting
Abstract:
While previous research in multivariate time series forecasting has focused on developing complex holistic models, this work advocates for a shift toward a granular, component‑level understanding of their impacts. We propose TSCOMP, the first large‑scale benchmark that systematically deconstructs deep forecasting methods into their core, fine‑grained components‑‑spanning series preprocessing, encoding strategies, network architectures including specific and large time‑series models, and optimization methods. Using constrained orthogonal experimental design and extensive evaluations, we conduct multi‑view analyses that reveal component effectiveness across different backbones, data characteristics, and their interactions. Beyond providing insights, this benchmark establishes a fine‑grained performance corpus comprising over 20,000 model‑dataset evaluations, which supports the learning of automated component selection, enabling zero‑shot model construction on new datasets. Our experiments demonstrate that the corpus‑driven approach, despite its simplicity, consistently outperforms state‑of‑the‑art methods, validating the soundness of our evaluation design and confirming that systematic component selection surpasses manually designed complex architectures. All code and the performance corpus are publicly available at https://github.com/SUFE‑AILAB/TSCOMP.

Authors:Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim, Woocheol Shin, Minsu Kim, Taeyoung Yun, Jeongjae Lee, Sanghyeok Choi, Tabitha Edith Lee, Jong Chul Ye, Jinkyoo Park
Title: Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference
Abstract:
Aligning a few‑step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few‑step Generative Models Alignment via Sample‑based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward‑tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample‑based variational inference scheme and amortize its particle updates into the generator parameters via fixed‑point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline‑to‑online RL tasks. For image generator alignment, FAV fine‑tunes diverse few‑step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet‑256 to 1024^2 text‑to‑image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.

Authors:Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu
Title: Recursive Flow Matching
Abstract:
Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed‑fidelity trade‑off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self‑consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics‑based tasks. To our knowledge, this is the first method to achieve high‑fidelity one‑ and few‑step (2‑4 step) dynamic generation for scientific systems with performance comparable to state‑of‑the‑art multi‑step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20× speedup over leading diffusion‑based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real‑time scientific emulation.

Authors:Jiawei Tang, Xinyan Du, Hui Liu, Junhui Hou, Yuheng Jia
Title: Variational Inference for Evidential Deep Learning
Abstract:
While Deep Neural Networks (DNNs) achieve remarkable performance, their tendency to produce overconfident predictions. Evidential Deep Learning (EDL) mitigates this by formulating predictions as a Dirichlet distribution over class probabilities to explicitly quantify epistemic uncertainty. However, we found that the conventional EDL suffers from two fundamental limitations: a Kullback‑Leibler (KL) penalty that only suppresses the evidence of negative classes, producing excessively high evidence therefore decreasing the model's ability to quantify uncertainty, and an absence in theoretical guarantee of setting Dirichlet parameter α=e+1. In this paper, we propose a mathematically principled framework, Variational Inference Evidential Deep Learning (VI‑EDL). By reformulating evidential learning through the lens of variational inference, we derive an Evidence Lower Bound (ELBO), which prevents the evidence from growing excessively. Theoretically, we rigorously establish a generalization bound and reveal how the predicted uncertainty, feature and network complexity affect this bound, and why setting \boldsymbolα = \mathbfe + \mathbf1 can minimize it. Extensive experiments on standard visual and medical datasets demonstrate that VI‑EDL achieves state‑of‑the‑art performance, showing excellent performance in out‑of‑distribution detection, noise detection and autonomous driving scenario. The code is available in https://github.com/seutjw/VI‑EDL.

Authors:Dhruv S. Kushwaha, Zoleikha A. Biron
Title: Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning
Abstract:
Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model‑free RL is limited by the need for accurate dynamics and hand‑designed barrier certificates. We propose Robust Koopman‑CBF SAC, a safety‑filtered actor‑‑critic framework that learns a finite‑dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic‑program safety layer. To account for finite‑dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held‑out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman‑CBF feasible set, reducing dependence on the filter over training. Across safe‑control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high‑dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first‑order velocity barriers and linear EDMD models, motivating high‑order and multi‑step Koopman‑CBF extensions. These results suggest that robust Koopman‑CBF filters are a promising bridge between model‑free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective. All code is available at \hrefhttps://github.com/DhruvKushwaha/Koopman‑CBF‑Soft‑Actor‑CriticGithub Repository.

Authors:Joohwan Ko, Justin Domke
Title: Amortized Factor Inference Networks for Posterior Inference
Abstract:
Amortized inference promises fast test‑time Bayesian inference, but existing methods are inherently tied to fixed models. Extending amortization to unseen models typically requires retraining or costly test‑time finetuning. In this paper, we ask: is it possible to build a single inference network capable of generalizing across varying priors, likelihoods, and dimensionality? We introduce Amortized Factor Inference Networks (AFINs), a family of encode‑merge‑decode inference networks built on dimension‑independent modules that map a model specification and its observations to the parameters of a variational posterior. Experimentally, a single trained AFIN achieves posterior accuracy comparable to NUTS and several variational inference methods, while requiring 2 to 4 orders of magnitude less test‑time compute. Code is available at https://github.com/joohwanko/AFINs.

Authors:Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro
Title: BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma
Abstract:
Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor‑related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision‑language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor‑related factors, limiting both accuracy and biological interpretability. We present BioFact‑MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image‑report pairs), BioFact‑MoE consistently improves survival prediction over all baselines across time horizons, achieving 12‑, 18‑, and 24‑month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype‑aware risk stratification. Pathway‑informed gating uncovers clinically meaningful treatment‑associated survival heterogeneity. In held‑out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy‑639/BioFact‑MoE.

Authors:Xinpeng Wang, William X. Cao, Andrew Gordon Wilson, Zhe Zeng
Title: Automatic Layer Selection for Hallucination Detection
Abstract:
Recent studies on hallucination detection have shown that hallucination‑related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high‑performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near‑optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training‑free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination‑related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic‑Layer‑Selection‑for‑Hallucination‑Detection.git

Authors:Athanasios Zeris
Title: Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention
Abstract:
Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end‑to‑end without explicit frequency decomposition) and scale‑selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy‑Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian‑windowed wavelets that adapt the joint position‑frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1‑3 baseline); MoPE alone is ‑0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 ‑‑ more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale‑initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character‑level benchmarks, single seed); larger‑scale multi‑seed validation is the most important direction for future work.

Authors:Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky
Title: Personalized Generative Models for Contextual Debiasing
Abstract:
Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text‑to‑image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

Authors:Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang, Yi Zhou
Title: When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning
Abstract:
In‑context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input‑output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness‑utility gap, we introduce task‑preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label‑updating perturbations, where task‑relevant semantics change and targets are recomputed, and stricter target‑preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task‑preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task‑preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao‑Qiu/Task‑Preserving‑ICL.

Authors:Sandeep Kumar, Virginia Smith, Chhavi Yadav
Title: Curriculum Learning for Safety Alignment
Abstract:
Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out‑of‑distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO‑based safety alignment. We propose Staged‑Competence, a curriculum‑based framework that organises preference data by difficulty, employs competence‑based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged‑Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near‑zero over‑refusal. We further show that Staged‑Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged‑Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum‑learning‑for‑safety.

Authors:Michael Fuchs, Dominik Kreiss
Title: Beyond Differences: Doubly Robust Meta-Learners for Ratio-Based Treatment Effects
Abstract:
When treatment effects are naturally expressed as ratios ‑‑ as in medicine, pricing, and marketing ‑‑ the ratio‑based CATE τ(x) = E[Y|W=1,X=x] / E[Y|W=0,X=x] is the appropriate estimand. Yet existing estimators either impose a log‑linear parametric structure or apply generic regression without robustness guarantees for this functional. We introduce the Q‑Learner, which decomposes τ(x) into a product of two odds ratios, reducing ratio‑CATE estimation for binary outcomes to two propensity classification tasks. We further derive doubly robust augmentations for both S/T‑ and Q‑style ratio learners and characterize their distinct robustness properties. In benchmarks on seven RCT datasets, the Q‑Learner is the most consistently competitive method in low‑conversion regimes, where its propensity‑only construction sidesteps the imbalanced regression that hurts outcome‑based estimators. On four observational datasets, where propensity must be estimated and confounding cannot be ruled out, the DR learners introduced here decisively come out on top, making them practitioners' natural default for confounded observational data.

Authors:Guanghui Wang, Kaiwen Lv Kacuila, Zhiyong Yang, Zitai Wang, Jin-Wen Wu, Longtao Huang, Qianqian Xu, Qingming Huang
Title: The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works
Abstract:
Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher's full next‑token distribution (soft labels). Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results. Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias, the mismatch between training and inference distributions. To explain this phenomenon, we introduce the Bridge‑Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible. We show that hard‑only KD excels in Bridges by avoiding risky deviations, while soft‑only KD preserves diversity in Gardens. A hybrid strategy handles both cases and, as a result, reduces exposure bias across the sequence. Guided by this theory, we develop a family of Bridge‑Garden hybrid supervision methods that adaptively balance hard and soft labels. Across a primary suite of seven teacher‑student pairs (including Qwen, Llama, Gemma, and DeepSeek) and benchmarks in reasoning and coding, our approach outperforms divergence‑based and on‑policy KD baselines while reducing training cost by 9.7x, enabling efficient model compression. Code is available at https://github.com/ghwang‑s/bridge_garden_hybrid_kd_release.

Authors:Shuwen Yu, William P Marnane, Geraldine B. Boylan, Gordon Lightbody
Title: HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals
Abstract:
This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic‑ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end‑to‑end manner, capturing both local and long‑range dependencies through a hybrid Convolution‑Transformer framework. By integrating convolutional layers for local feature extraction and Transformer‑based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one‑hour epochs, including 259 one‑hour expert‑annotated epochs and a substantial set of weakly labelled data. A 314‑hour validation set provided a robust performance estimation, while an independent 215‑hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan‑Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23% and accuracy of 74.56% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer‑based components for HR‑based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: https://github.com/syu‑kylin/HRVConformer.

Authors:Zihang Zhou, Ziqian Ren, Yukai Wu, Yingjie Xiong, Wei Zhou, Chao Peng, Dong Zhang, Bingheng Yan, Xuanhe Zhou, Fan Wu
Title: SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
Abstract:
Functionality‑correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository‑specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification‑strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross‑repository experience transfer, (2) multi‑step trial‑and‑repair under non‑invertible state changes, and (3) robust verification of setup outcomes to distinguish setup‑induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning‑based setup framework. First, we construct a Self‑Evolving Experience Representation (XPU), a dual‑modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience‑Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known‑good states. Third, we introduce a Prosecutor‑Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build‑time metrics. Evaluation results on carefully‑crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi‑repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

Authors:Ke Li, Dong An, Xiaoling Zang, Can Ye, Liang Xie, Qibo Qiu, Chen Shen, Xiaofei He, Wenxiao Wang
Title: InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization
Abstract:
Low‑bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low‑bit uniform quantizer. Existing post‑training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer‑facing distribution design and analyze quantization error from an information‑theoretic perspective. Our analysis shows that quantization‑friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train‑free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization‑friendly distributions. We further introduce adaptive outlier‑token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end‑to‑end training baselines. Under W4A4KV4, it preserves 97% of floating‑point accuracy on average and reduces the LLaMA‑2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)

Authors:Zejia Qi
Title: LearnedCache: An eBPF-Integrated Perceptron-Based Eviction Policy for the Linux Page Cache
Abstract:
Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduce extraneous disk access. Many page cache eviction policies have been developed but remain bound by the rigidity of heuristics. The rise of AI‑driven tools in recent years, melded with the ever‑increasing variety of workloads for Linux devices, sets the stage for machine‑learning‑driven cache eviction policies. Promising research has been done in this field, but only in the field of user‑space applications such as CDNs. We develop LearnedCache, an eBPF‑integrated single‑layer perceptron‑based cache eviction policy for the Linux page cache, trained on real kernel data from diverse workloads. We demonstrate median AUCs of nearly 80% over multiple linear models modeling page reuse time, then take a step further by embedding these models inside the Linux kernel for real‑time performance evaluation. Through statistical testing over 50 paired trials against a baseline of FIFO for each workload, LearnedCache reveals that machine‑learning‑derived cache eviction policies are practical in the Linux kernel under representative empirical workloads and are able to surpass conventional FIFO by statistically significant margins of up to 10% in insertion rate, a frequency‑adjusted derivation of cache hit rate, in specific workloads while incurring minimal overhead.

Authors:Hanzala Afzaal, Danish Memon, Chouhdary Bilal Raza, Muhammad Khurram Shahzad
Title: Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning, Reliable Pseudo-Labels, and Lightweight Architectures
Abstract:
The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource‑efficient Intrusion Detection Systems (IDS) capable of handling dynamic and evolving cyber threats. This paper investigates AOC‑IDS, a state‑of‑the‑art autonomous online IDS published at IEEE INFOCOM 2024, which employs an Autoencoder (AE) with Cluster Repelling Contrastive (CRC) loss and an autonomous Gaussian‑based decision module. We first successfully replicate AOC‑IDS on the UNSW‑NB15 benchmark, achieving 89.39% accuracy in close agreement with the published 89.19%. We then identify four key limitations: class imbalance, unreliable pseudo‑label generation, limited generalization, and computational overhead for IoT deployment, and propose targeted improvements for each. Our XGBoost‑BalSamp method achieves 95.45% accuracy on UNSW‑NB15, a gain of 6.26% over the baseline. Our combined deep learning approach (PseudoFilter, MixupAug, and LiteAE) achieves a best‑run accuracy of 90.88% (F1: 91.45%), surpassing the base paper while reducing model parameters by 55%.These results demonstrate that targeted improvements to AOC‑IDS yield consistent accuracy gains while improving practical deployability on IoT edge devices.

Authors:Xindi Tong, Chee Wei Tan, H. Vincent Poor
Title: Adversarial Water-Filling: Theory, Algorithms and Foundation Model
Abstract:
Competitive resource allocation problems over frequency and space can be formulated as minimax interaction between transmit power and worst‑case interference. This formulation naturally arises in multi‑operator low Earth orbit (LEO) satellite spectrum sharing, where transmissions from competing constellations interfere in real‑time. Under Gaussian channels, AWF is strongly convex‑‑concave on nondegenerate active channels, whereas discrete constellations yield generally nonconvex mercury/water‑filling formulations. In this paper we propose the Adversarial Water‑Filling (AWF) problem with corresponding theory and algorithms for these real situations. In addition, we develop a wireless foundation model for AWF to learn the AWF search dynamics. The architecture incorporates permutation‑invariant channel representations, a constraint‑aware graph neural network (GNN) with sparse message passing, and global latent variables capturing the low‑dimensional water level implied by the AWF optimality. Through learned projected extragradient iterations, the model approximates stationary solutions of the constrained minimax problem arising under mercury/water‑filling. We further show that, under local regularity and contractivity conditions, the learned AWF dynamics converge locally linearly around regular stationary points. Experiments demonstrate empirical generalization across unseen problem sizes, different constraints, and multiple discrete constellations, while achieving more than one‑order‑of‑magnitude runtime improvements over iterative baselines. The related code can be found at https://github.com/convexsoft/AWF.

Authors:Dongxu Yang
Title: Device Context Protocol: A Compact, Safety-First Architecture for LLM-Driven Control of Constrained Devices
Abstract:
Large language models are increasingly used as orchestrators of external tools via the Model Context Protocol (MCP), but MCP is built for software services with megabytes of memory and does not descend to the microcontrollers that dominate the long tail of physical devices. Recent work (IoT‑MCP) ports MCP to edge gateways at 74 KB peak memory; this still excludes the smallest commodity MCUs and, critically, does not address the safety problem of giving an unreliable caller (an LLM that may hallucinate or be prompt‑injected) direct control of physical hardware. We present the Device Context Protocol (DCP): a sub‑50‑byte typical frame (6‑byte header + CBOR payload + optional 16‑byte HMAC), a manifest schema in which capability scoping, range and type checks, dry‑run evaluation, and units‑as‑types are protocol‑layer primitives, and a host‑side Bridge that rejects malformed or hallucinated calls before any byte reaches the device. Reference firmware measures 27.6 KB flash / 0.6 KB RAM on ESP32; the Python Bridge, ESP32 firmware, and a language‑neutral conformance suite are MIT‑licensed and public. An empirical study ‑‑ 675 tool calls produced by five LLMs across four vendors (DeepSeek, Alibaba, Zhipu, MiniMax) against six categories of adversarial prompts, with the injection category instantiating AgentDojo's attack templates ‑‑ shows DCP rejects 100% of capability‑escalation attempts and 78% of prompt‑injection attempts, versus 0‑‑1% for Raw MCP and IoT‑MCP, matching the expressiveness of a well‑formed OpenAPI 3 schema at three orders of magnitude less firmware footprint. We position DCP as the missing layer between MCP (which is moving toward enterprise SaaS connectivity) and the physical devices it does not reach.

Authors:Tongxi Wu, Jian Zhang, Yang Gao
Title: Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
Abstract:
Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near‑binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi‑metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection‑based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene‑anchored prompts without model‑specific optimization. Furina outperforms strong single‑turn and multi‑turn baselines on HarmBench and achieves competitive results on MM‑SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: https://github.com/0xCavaliers/Furina_Jailbreak.

Authors:Xianglin Yang, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong
Title: Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
Abstract:
The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black‑box adversarial framework that learns semantics‑preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI‑reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1‑2 points on a 9‑point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style‑control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM‑as‑a‑judge paradigm and motivate robust, attack‑aware evaluation. Our code is available at https://github.com/xianglinyang/llm‑as‑a‑judge‑attack.

Authors:Venkatakrishnan Gopalakrishnan
Title: SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection
Abstract:
Unsupervised anomaly detection is widely used in transaction fraud detection where labels are scarce. Isolation Forest (IF) is among the most popular classical methods due to its scalability and ease of deployment. We propose SilIF, an augmentation of Isolation Forest that adds a silhouette‑based scoring layer computed in a representation space induced by the trees of the forest. For each point, we extract a vector of per‑tree path lengths, cluster these "fingerprints" into structural groups, and compute a silhouette score that measures how well the point fits its assigned group versus the nearest alternative. The silhouette signal is combined with the base IF score via a single hyperparameter alpha. On the IEEE‑CIS Fraud Detection benchmark (~590K transactions, 3.5% fraud), SilIF with alpha=1.0 improves over plain Isolation Forest by +0.0080 AUC‑PR on average across five seeds, with SilIF winning on all five seeds (paired t‑test p=0.046). We also report results on a synthetic credit‑card dataset (Sparkov) where the silhouette augmentation does not improve over plain IF, and we characterize the conditions that distinguish the two outcomes. The paper presents SilIF as a tunable, easy‑to‑deploy enhancement to Isolation Forest with honest reporting of when it helps and when it does not. Code at https://github.com/venkat15vk/silif‑anomaly‑detection.

Authors:Xu Yao, Siyuan Zhou, Zhenbo Wu, Chaochuan Hou, Shuang Liang, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang
Title: Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark
Abstract:
Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label‑scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open‑source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE‑AILAB/WSADBench.

Authors:Parth Darshan, Abhishek Divekar
Title: When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Abstract:
Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural‑language critiques, not numerical vectors. Thus, the conflict‑resolution toolkit of multi‑task learning (PCGrad, MGDA) does not apply to this multi‑objective textual gradient setting. We extend TextGrad to the multi‑objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross‑objective information the loss, gradient and optimizer LLMs share. We find the gradient's task‑focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single‑objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (‑0.085). These results identify two separable failure modes: optimization‑time gradient dilution and inference‑time instruction interference, which together constrain the design space for multi‑objective judge optimization using textual feedback.

Authors:Sam Bowyer, Acyr Locatelli, Kris Cao
Title: Efficient Benchmarking Is Just Feature Selection and Multiple Regression
Abstract:
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information‑theoretic feature‑selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data‑poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman ρ and Kendall τ) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .

Authors:Santosh Kumar Radha, Oktay Goktas
Title: UWM-JEPA: Predictive World Models That Imagine in Belief Space
Abstract:
World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector‑valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM‑JEPA), a JEPA world model with a density‑matrix latent on a joint system‑environment space and a learned unitary predictor. The construction preserves the joint‑state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden‑velocity indicator task requiring five‑step forward simulation under a given action sequence with the target observation masked, UWM‑JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter‑matched LSTM‑JEPA trained under the same counterfactual‑target objective and action head collapses to majority‑class accuracy (0.53) under every action condition. Under blind rollout, UWM‑JEPA loses fewer than ten points of probe R^2 at short horizons while vector‑latent baselines lose forty‑one and sixty‑eight; both nevertheless tie on a held‑out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher‑forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context‑encoding capacity alone.

Authors:Sohaib Lafifi
Title: Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies
Abstract:
We give an attribution method for neural combinatorial‑optimisation (CO) policies that (i) decomposes a decision by constraint families via LP‑relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility model (implemented as a CSP feasibility‑decision model), and (iii) bounds the size of a PAC‑sufficient explanation with a Bonferroni‑corrected Hoeffding sufficient‑subset test along a greedy ordering. Across three CO problems and three seeds, our LP‑anchored Λ‑attribution matches the CF‑derived signal at 96.5% on CVRPTW (n_cert=344) and 77.2% on the Orienteering Problem (n_cert=281) vs 75.0% and 35.2% for proxy gradient (paired diffs +0.215 and +0.420; McNemar exact p \le 10^‑14). In the rank‑aligned regime of the Flexible Job‑Shop Scheduling Problem, both backends agree on every CSP‑certified flip (n_cert=59), confirming the no‑gain prediction. Bonferroni‑PAC subsets average 5.0 nodes per step (M=70, \varepsilon=δ=0.2, k_\max=25). Reference implementation: https://github.com/sohaibafifi/neuro‑co‑cax

Authors:Gorgi Pavlov
Title: Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization
Abstract:
We apply the influence‑adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low‑bit weight‑only LLM quantization. The recipe is one math‑invariant transformation: WHT‑rotate each linear layer's weight matrix and rescale its columns by per‑coordinate Walsh‑basis activation energy before handing off to a reconstruction‑error quantizer (Intel auto‑round). This biases per‑group integer rounding toward high‑spectral‑energy channels. On four pretrained decoder‑only models from 135M to 1.5B parameters, BBT‑spectral reduces wikitext‑2 perplexity by 15‑58% relative to vanilla auto‑round at W2A16; we also report a TinyLlama‑1.1B auxiliary data point. Three extensions transfer the recipe to families it failed on: a per‑head PCA matrix‑Gamma replacement of q_norm/k_norm for Qwen3 attention (PPL 136.76 ‑> 88.99 on Qwen3‑0.6B); an SO(2) per‑pair rotation that commutes with RoPE (PPL 36.93 ‑> 21.84 on Qwen2.5‑1.5B); and an MoE‑aware input‑side absorption fix identified by architectural fuzzing of Laguna‑style fused‑expert layouts. A W2‑vs‑W4 ablation gives a deliberate negative control: the redistribution payoff falls within the +/‑0.5 PPL noise floor at W4, consistent with the Schur‑convexity intuition that the cost of unconcentrated influence vanishes as the noise budget shrinks. All quantized weights export to OpenVINO IR and run on Intel NPU + Arc dGPU + CPU with PPL invariant to device within +/‑0.1. We do not claim a formal Boolean‑to‑real‑valued transfer of the theory paper's majorization argument: the WHT activation energy used here is not the Boolean influence of the theory paper, the link is intuitive, and the contribution is engineering value rather than a transferred theorem. Head‑to‑head benchmarks against SpinQuant, QuaRot, QuIP‑sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are the main future‑work item.

Authors:Ruitao Liu, Qinghao Hu, Alex Hu, Yecheng Wu, Shang Yang, Luke J. Huang, Zhuoyang Zhang, Han Cai, Song Han
Title: Hide to Guide: Learning via Semantic Masking
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning‑intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward‑relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer‑related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided‑RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine‑grained semantic masking strategy for expert‑guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward‑relevant semantic spans along the critical path while preserving the expert's decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill‑in‑the‑blank process: the policy can follow the expert's problem‑solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at https://github.com/mit‑han‑lab/SMEPO.

Authors:Mohamed Boussena, Florence Monville, Jacques Fieschi-Meric, Frederic Vely, Pierre Milpied, Julien Mazieres, Maurice Perol, Eric Vivier, Laurent Greillier, Fabrice Barlesi, Sebastien Benzekry
Title: Multimodality Stacking with Blockwise missing values and application to the PIONeeR biomarkers study for prediction of resistance to immunotherapy
Abstract:
Integrating multimodal datasets in clinical oncology is frequently hindered by high dimensionality and blockwise missingness, where entire data sources are unavailable for specific patient subsets. Standard survival models often struggle with these gaps, leading to biased results or patient exclusion. We introduce Multimodality Stacking with Blockwise missing values (MSB), a late‑fusion framework for survival analysis that independently models modality‑specific features before aggregating predictions via a cross‑validated stacking meta‑learner. MSB was validated on the PIONeeR study (n=443 patients, 378 biomarkers across eight heterogeneous sources) to predict progression‑free survival in advanced non‑small cell lung cancer patients receiving immunotherapy. MSB yielded higher predictive performance (C‑index) than baseline algorithms. Improvements varied by baseline strength: linear models showed a 15.9% increase (p<0.001 for the Wilcoxon signed‑rank test), random survival forests gained 5.4% (p=0.002), and gradient boosting methods improved by 2.1% (p=0.030). Beyond discrimination, MSB reduced the generalization gap (train‑test difference in 5 folds cross‑validation repeated 3 times: 0.055 vs 0.380 for linear models). Permutation importance analysis identified routine laboratory markers, clinical features, and PD‑L1 expression as primary predictive drivers. Missing block indicators showed negligible importance, suggesting the model learned from biomarker values rather than data availability patterns. MSB provides a statistically validated framework for multimodal survival prediction with blockwise missingness. By enabling systematic biomarker evaluation without requiring complete data, MSB offers a practical tool for predictive modeling in biomedical research, pending external validation. Implementation is available at https://github.com/MohamedBoussena/MSB under Inria license.

Authors:Festus Kahunla
Title: TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis
Abstract:
Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi‑session behavioral logs, is formulaic and high‑volume, yet real session data is HIPAA‑protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy‑Referenced ABA Clinical Examples), a 2,999‑example synthetic instruction‑tuning dataset covering two ABA tasks: teaching‑program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi‑session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy‑driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY‑NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.

Authors:Yichen Luo, Peiyu Zhu, Dongxiao Hu, Jia Wang, Tailin Wu, Dapeng Lan, Yu Liu, Zhibo Pang
Title: Mitigating Gradient Pathology in PINNs through Aligned Constraint
Abstract:
While Physics‑Informed Neural Networks (PINNs) are powerful for solving Partial Differential Equations (PDEs), their training is often paralyzed by gradient pathology. The gradients from the PDE residuals and boundary constraints oppose each other, trapping the model in local minima. Current solutions, such as adaptive weighting or hard constraints, either fail to fundamentally resolve this ill‑conditioning or are limited to simple geometries. In this study, we systematically analyze the possible causes of this gradient pathology from the perspectives of loss landscapes and optimization dynamics. Based on the obtained conclusion, we propose Constraint‑Aligned loss with Manifold Lifting (CAML). By reformulating all zeroth‑order terms into aligned constraints, our method effectively mitigates gradient conflicts. In addition, we introduce a delay factor to help the optimizer skip the high‑curvature area. Experiments demonstrate that our CAML significantly enhances numerical stability and efficiency in highly complex PINN problems. Our code is open‑sourced on https://github.com/YichenLuo‑0/CAML.

Authors:Ibrahim Delibasoglu
Title: Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection
Abstract:
The rapid evolution of generative models has enabled the creation of hyper‑realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out‑of‑the‑box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross‑domain evaluation comparing three foundational learning paradigms: fully supervised macro‑semantic features (RoPE‑ViT), pure self‑supervised geometric features (DINOv3), and multi‑teacher agglomerative representations (NVIDIA C‑RADIOv4‑H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade‑offs between pre‑training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake

Authors:Da Zhang, Bingyu Li, Zhiyuan Zhao, Hongyuan Zhang, Junyu Gao, Xuelong Li
Title: MedMamba: Multi-View State Space Models with Adaptive Graph Learning for Medical Time Series Classification
Abstract:
Medical time series are central to healthcare, enabling continuous monitoring and supporting timely clinical decisions. Despite recent progress, existing methods struggle to jointly model local‑global dynamics and handle nonstationarities like baseline drift, while often failing to capture latent channel interactions. To address these challenges, we propose MedMamba, an end‑to‑end architecture that integrates state space models with domain‑specific inductive biases. Specifically, MedMamba first employs multi‑scale convolutional embeddings to capture discriminative local morphology. Second, to mitigate nonstationarity, we introduce a tri‑branch differential state space encoder that processes raw, temporal‑difference, and frequency‑domain views, fusing them to emphasize informative patterns while suppressing drift. Furthermore, to uncover latent channel correlations, we design a spatial graph Mamba module that learns a directed dependency structure regularized toward sparsity and acyclicity, which obviates the need for predefined graphs. Extensive experiments on five real‑world datasets demonstrate that MedMamba achieves state‑of‑the‑art performance while maintaining linear computational complexity, and ablation studies validate each component's contribution.Code is available at https://github.com/zhangda1018/MedMamba.

Authors:Ruize Li, Zhibin Wen, Tao Han, Hao Chen, Fenghua Ling, Wei Zhang, Song Guo, Lei Bai
Title: RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges
Abstract:
Accurate evaluation of weather forecasting models is critical for their reliable deployment in real‑world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real‑time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real‑world forecasting. In this work, we introduce RealBench, a next‑generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out‑of‑distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low‑latency operational analysis and a large‑scale global in‑situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high‑impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event‑specific metrics that better reflect real‑world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis‑based metrics and real‑world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next‑generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize‑del/NWP‑Benchmark.

Authors:Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos
Title: HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos
Abstract:
Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity‑level representation of hand‑object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot‑data‑free, hardware‑agnostic, data‑efficient, and zero‑shot human‑to‑robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real‑world tasks (75% with just 15 minutes), outperforms matched‑time robot teleoperation by 41%, and robustly transfers zero‑shot across novel robots, cameras, and environments. We release HumanEgo as an easy‑to‑use, open‑source framework for learning robot policies directly from human data: https://github.com/TX‑Leo/HumanEgo

Authors:Ali Noshad, Zishan Zheng, Yinjun Wu
Title: MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation
Abstract:
To reduce LLM costs and latency, semantic caching systems must accurately identify when a new prompt matches a cached one. Current methods often rely on simplistic similarity measures, which limit their effectiveness. We introduce MVR‑cache, a novel semantic caching approach that significantly improves retrieval accuracy by integrating Multi‑Vector Retrieval (MVR). MVR‑cache is built upon a learnable segmentation model that intelligently splits prompts, enabling fine‑grained similarity comparisons via MaxSim. We derive the model's training objective from a rigorous theoretical analysis. This can ensure that optimizing this objective directly maximizes cache hits under strict correctness constraints. To solve the resulting non‑differentiable combinatorial optimization problem, we leverage a reinforcement learning‑based training strategy with the theoretically grounded objectives as the reward. Experimental results on established benchmarks across diverse tasks confirm that in comparison to the state‑of‑the‑art, MVR‑cache consistently increases the cache hit rates by up to 37% while maintaining the same correctness guarantees. MVR‑cache is available at https://github.com/PKU‑SDS‑lab/MVR‑Cache

Authors:Mini Han Wang, Liting Huang, Wei Hong, Boonthawan Wingwon
Title: Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes
Abstract:
Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction across metabolic, renal, lipid, and inflammatory pathways. Existing clinical assessments often fail to capture this multi‑dimensional burden. Methods: We conducted a retrospective study of 1,195 patients using routinely collected laboratory biomarkers. System‑level abnormality indices were constructed to quantify organ‑specific dysfunction, and multi‑system involvement was defined as abnormalities in two or more systems. Supervised machine learning models, including logistic regression, random forest, and gradient boosting, were trained to predict multi‑system dysregulation. Model interpretability was achieved using SHapley Additive exPlanations (SHAP). Results: The gradient boosting model demonstrated near‑perfect discrimination (AUC = 1.000), significantly outperforming logistic regression (AUC = 0.925). Feature attribution analysis revealed that hyperglycaemia, renal impairment, dyslipidaemia, and inflammation were the dominant drivers of multi‑system risk. Dose‑response relationships observed in partial dependence analyses further supported the biological plausibility of model predictions. Conclusion: This study presents an interpretable, data‑driven framework for quantifying systemic disease burden in T2DM. By linking routine biomarkers to multi‑organ dysfunction, our approach provides both predictive accuracy and mechanistic insight, offering potential for improved risk stratification and precision medicine in diabetes care. The data and code used in this study are openly available on GitHub at: https://github.com/MiniHanWang/Type‑2‑Diabetes‑1.git

Authors:James Henry
Title: The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth
Abstract:
Concept formation in transformer language models is depth‑extended, not a single‑layer event: concepts emerge gradually across a contiguous region of the residual stream. Mechanistic interpretability methods identify the single layer of peak class separation ‑‑ the "best layer" ‑‑ capturing a snapshot rather than the process itself. We introduce the Concept Allocation Zone (CAZ): the depth interval within which a concept becomes measurably separable, the region allocated to its geometric expression. We formalize the CAZ through three layer‑wise metrics (Separation, Concept Coherence, Concept Velocity) and derive principled boundary detection without manual layer sweeps. A CAZ is not a concept: it is the depth region within which the model organizes its geometry to make a concept separable. A single concept typically participates in multiple CAZes; multiple concepts may share one. Empirical validation across 34 models from 8 architectural families and 7 concepts reveals that the separation curve S(l) is frequently multimodal. A scored detector uncovers "gentle CAZes" ‑‑ subtle allocation regions invisible to standard peak detection but causally active in 93‑100% of cases under ablation (16 of 34 models; 26 in the companion validation paper). The framework generates seven testable predictions; four yield clear verdicts (two not supported, one partially supported, one supported), one had its precondition invalidated by the data, and two are underpowered ‑‑ with cross‑architecture alignment confirmed as depth‑matched rather than monolithic under leave‑one‑concept‑out cross‑validation. Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).

Authors:Zeyu Shen, Zhuoyuan Wang, Laixi Shi
Title: T2S-MPC: Time-Embedded Online Adaptive Model Predictive Control for Time-Varying Dynamics
Abstract:
Recent advances in learning‑based model predictive control (MPC) have leveraged neural networks for online model learning, achieving strong performance when nonstationary system dynamics deviate from nominal models. However, existing approaches primarily address specific or relatively structured forms of dynamical variation, leaving more general, unknown, and unpredictable time‑varying dynamics insufficiently handled. To tackle this challenge, we propose T2S‑MPC, a framework that adaptively learns a residual dynamics model online and integrates it with the nominal model within the MPC framework to enable fast‑evolving online planning. To make the model time‑aware, we explicitly encode temporal information through a structured time embedding and employ a two‑timescale update scheme, allowing the controller to capture nonstationary dynamics while balancing rapid adaptation with stable learning. We evaluate the proposed method on a 2D quadrotor across stabilization and trajectory tracking tasks under diverse time‑varying disturbances, including linear drifting and periodic perturbations. Experimental results show that T2S‑MPC consistently outperforms classical MPC, neural MPC, and ablated variants in control performance, while also demonstrating strong robustness across a wide range of disturbance conditions without additional tuning. The source code is publicly available at https://github.com/Zeyuu0920/T2S_MPC

Authors:Jinjin He, Zhiqi Li, Sinan Wang, Bo Zhu
Title: Hermite-NGP: Gradient-Augmented Hash Encoding for Learning PDEs
Abstract:
We propose Hermite‑NGP, a gradient‑augmented multi‑resolution hash encoding designed to enable fast and accurate computation of spatial derivatives for neural PDE solvers. Unlike existing NGP‑based approaches that rely on automatic differentiation or finite differences and suffer from instability or high cost, Hermite‑NGP explicitly stores function values and mixed partial derivatives at hash grid vertices, allowing fully analytic evaluation of gradients, Jacobians, and Hessians via Hermite interpolation. This design preserves the efficiency and spatial adaptivity of NGP while supporting analytic differential operators up to second order. We further introduce a multi‑resolution curriculum training strategy analogous to multigrid V‑cycles to enable coarse‑to‑fine optimization. Across a range of 2D and 3D PDE benchmarks, Hermite‑NGP achieves up to approximately 20 times lower error than prior neural PDE methods, and reduces wall‑clock convergence time by 2 to 10 times compared to other solvers, with per‑epoch training times as low as 3.5 ms for models with up to 17M parameters.

Authors:Ismail Lamaakal
Title: Motion-Compensated Weight Compression
Abstract:
Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross‑layer redundancy induced by function‑preserving symmetries. We propose Motion‑Compensated Weight Compression (MCWC), a weight‑only codec that aligns permutation‑symmetric blocks (e.g., hidden units and attention heads) to maximize cross‑layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer‑sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor‑driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight‑codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via https://github.com/Ism‑ail11/MCWC.

Authors:Michel A. Youssef
Title: CALIBURN: A Regime-Sensitivity Study of Operationally Calibrated Streaming Intrusion Detection
Abstract:
Streaming network intrusion detection systems must process flows continuously while keeping memory bounded, but most current methods leave alerting threshold selection as a post‑hoc tuning problem poorly suited to production. Operators need alerting behaviour specifiable before deployment using inputs such as false‑negative cost, false‑positive cost, and alerting budget. This paper presents CALIBURN, a five‑component streaming alerting pipeline composed of a truncated Bayesian online change‑point detector, an isotonic calibration layer mapping the change‑point posterior to an empirical conditional attack probability, a cost‑sensitive decision threshold derived from operator‑specified misclassification costs, a Conformal Risk Control wrapper that converts an alert‑budget specification into a within‑window valid threshold under exchangeability, and a multi‑window burn‑rate alerting layer adapted from Site Reliability Engineering practice. Rather than claiming uniform dominance, we present CALIBURN as a regime‑sensitivity study, evaluating the pipeline across three attack‑prevalence regimes: LITNET‑2020 at 5.2 percent, CICIDS2017 at 22.06 percent, and UNSW‑NB15 at 64 percent. In the rare‑attack regime, CALIBURN achieves AUC‑PR 0.943 on LITNET‑2020, outperforming the best streaming baseline by 2.21x and the best batch reference by 4.12x; isotonic calibration reduces Brier score by 30 percent. In the moderate‑prevalence regime, CALIBURN remains the strongest streaming method on CICIDS2017 but is exceeded by batch density methods. In the high‑prevalence regime, all streaming methods approach the prevalence floor. We further identify two distinct CRC‑collapse mechanisms driving the alert rule to degeneracy at small alpha, treating both as operational guidance for practitioners.

Authors:Sol Park, Soobin Um
Title: Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion
Abstract:
Minority sampling aims to generate low‑density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model‑specific notions that may poorly reflect real‑world semantics. In this work, we propose a world‑centric perspective on minority sampling, which defines rarity with respect to real‑world priors rather than generator‑induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint‑Embedding Predictive Architecture (JEPA) ‑‑ a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low‑density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real‑world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class‑conditional, and text‑to‑image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator‑centric baselines in capturing real‑world notions of rarity. Code is available at https://github.com/soobin‑um/jepa‑guidance.

Authors:Jaeung Lee, Dohyun Kim, Jaemin Jo
Title: Measuring the Depth of LLM Unlearning via Activation Patching
Abstract:
Large language model (LLM) unlearning has emerged as a crucial post‑hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output‑level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white‑box studies reveal such residual knowledge but often rely on auxiliary training or dataset‑specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0‑1 scale. In a meta‑evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white‑box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning‑depth‑score

Authors:Piotr Wilam
Title: CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer
Abstract:
A sparse 8‑layer code transformer develops dedicated neural circuitry for every Python construct tested, and that circuitry is organised by a clean computational principle rather than by semantic category. We extract neural circuits for 106 concepts (43 AST node types, 63 builtin objects) by marginalising across 63,800 controlled prompts, and decompose each circuit into concept‑specific and token‑driven components using contrastive checker prompts that present a keyword token without its associated syntactic structure. Three findings emerge. First, all 106 concepts produce non‑empty universal circuits at every one of nine parameter settings, and the ranking of concept‑specificity across constructs is stable across the sweep ‑ survival is not an artifact of a permissive threshold. Second, AST circuits contain a genuine concept component distinct from token activation: concept‑only neurons constitute up to 62.5% of the loudest‑firing neurons at mid‑to‑late layers, while builtin circuits are almost entirely token‑driven. Third, six computationally atomic constructs ‑ Import, ImportFrom, Break, Continue, Pass, Assert ‑ cluster together despite being semantically unrelated, sharing only the property of being single‑statement constructs requiring no nested body; this atomicity super‑cluster, together with a four‑tier hierarchy organised by token ambiguity and structural distinctiveness, shows that the model's internal organisation tracks computational structure rather than meaning. The methodology, full decomposition data, and analysis code are released.

Authors:Yuki Nakamura
Title: Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol
Abstract:
Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned‑minus‑base activations on safety‑relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four‑variant decomposition of the modification matrix (naive, template‑controlled, within‑aligned, and difference‑in‑differences, DiD) that separates the two effects. Template control alone removes a 2.0‑3.9x inflation of the measured effective rank across Llama‑3.1‑8B, Gemma‑2‑9B, and Qwen‑2.5‑7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18‑0.39 to 0.50‑0.86. Projection‑ablation across the three families confirms the recovered subspace is behaviorally active and that singular‑value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation‑difference studies of alignment.

Authors:Sattam Altuuaim, Lama Ayash, Muhammad Mubashar, Naeemullah Khan
Title: PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training
Abstract:
Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before training begins. This static design can limit their ability to respond to changing gradient behavior across the loss landscape, where training may shift between stable, noisy, and inconsistent regimes. This study proposes PILOT (Policy‑Informed Learned OpTimizer), an online optimizer that adapts its update behavior during training. Rather than using a fixed balance between momentum, normalization, and sign‑based updates, PILOT uses gradient‑direction agreement as a signal of local training stability. Conditioning the update rule on this agreement signal allows the optimizer to adjust its behavior when gradients become stable, noisy, or inconsistent. Experiments on FashionMNIST and CIFAR‑10 show that PILOT consistently achieves the highest accuracy among the evaluated optimizers across convolutional settings. On the CNN architecture, PILOT reaches 94.13% on FashionMNIST and 81.94% on CIFAR‑10. On ResNet‑18, it further improves performance, reaching 95.71% on FashionMNIST and 93.42% on CIFAR‑10. These results suggest that learning how to adapt the update structure during training can improve performance across both compact and deeper convolutional models while preserving a simple first‑order optimization framework. The implementation of PILOT is publicly available at https://github.com/SattamAltwaim/PILOT.git

Authors:Yosef Worku Alemneh, Kidist Amde Mekonnen, Maarten de Rijke
Title: The Multilingual Curse at the Retrieval Layer: Evidence from Amharic
Abstract:
Multilingual retrieval increasingly underpins cross‑lingual question answering and retrieval‑augmented generation. Strong zero‑shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late‑interaction, learned sparse, and cross‑encoder paradigms, we compare zero‑shot multilingual retrievers, Amharic‑fine‑tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero‑shot multilingual retriever underperforms the strongest monolingual Amharic first‑stage retriever by 23% relative MRR@10. Fine‑tuning two recent multilingual embedding models on the same Amharic supervision yields 32‑60% relative MRR@10 gains over zero‑shot, but the best Amharic‑fine‑tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero‑shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in‑language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release the dataset, codebase, and trained models at https://github.com/rasyosef/amharic‑neural‑ir.

Authors:Zexuan Chen, Sichao Liu, Runhao Lu, Huichao Qi, Alexandra Woolgar, Xi Vincent Wang, Lihui Wang
Title: MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding
Abstract:
Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri‑modal contrastive framework for EEG‑based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two‑stage design. First, we pre‑train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio‑temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM‑generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG‑image signal. The encoder integrates subject‑specific adaptation, graph‑attention over channels, and temporal‑spatial convolutional embeddings. On the Things‑EEG2 200‑way zero‑shot benchmark, our framework achieves 54.1% Top‑1 and 83.4% Top‑5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p < 0.01) over all in‑subject baselines. We validate generalization on Things‑MEG. Analysis reveals that compact embedding geometries (CN‑CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically‑grounded visual decoding from non‑invasive temporal neural signals. The source code is publicly available in https://github.com/anon‑eeg/eeg_image_decoding.

Authors:Muhammad Muneeb, David B. Ascher
Title: AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction
Abstract:
Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein‑language‑model‑derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome‑wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM‑derived features, conservation metrics, population‑frequency variables, established pathogenicity predictors and engineered amino acid/codon‑context features. Using 132,714 ClinVar‑labelled missense variants, we benchmarked machine‑learning and deep‑learning models under controlled feature configurations. The full 303‑feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC‑AUC = 0.9950 across stratified five‑fold cross‑validation. Restricted naive and location‑oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity‑controlled ablations showed that removing prior‑predictor, population‑frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM‑derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1‑score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.

Authors:Karan Sharma, Aditya Tripathi, Rahul Mishra, Tapas Kumar Maiti
Title: ChainLearn: A Blockchain-Based Capacity-Aware Framework for Federated Ensemble Learning
Abstract:
Federated learning is used in medical imaging where privacy prohibits centralizing data. Standard federated algorithms assume homogeneous hardware, identical architectures, and centralized aggregation, which fails when hospitals have unequal compute resources. We propose capacity‑aware coordination: measure each hospital's throughput, assign capacity‑appropriate architectures (MobileNetV3‑Small, EfficientNet‑B0, ResNet‑50), and combine predictions via weighted ensemble. Weak and strong hospitals can participate without forcing uniform architectures. We separate on‑chain policy from off‑chain learning. A Solidity contract stores hospital registration, benchmark hashes, metrics, and weights. Hospitals train locally and submit only hashes and scalars (not parameters). Weighted ensemble inference is computed off‑chain. Experiments on PneumoniaMNIST and DermaMNIST (5 seeds, 3 non‑IID levels) show our method achieves lower or equal calibration error versus equal‑weight ensemble and competitive accuracy versus FedAvg, FedProx, and FedMD. Communication overhead is 224 bytes per round, a reduction of over 912,000x compared to FedAvg.

Authors:Zherui Yang, Tao Du, Ligang Liu
Title: Learning Laplacian Eigenspace with Mass-Aware Neural Operators on Point Clouds
Abstract:
The eigendecomposition of the Laplace‑‑Beltrami Operator (LBO) is fundamental to geometric analysis, yet computing its low‑frequency eigenmodes remains a significant bottleneck due to the high cost of iterative solvers on large‑scale data. To amortize this cost, we introduce the Neural Eigenspace Operator (NEO), a feed‑forward framework designed to predict the spectrum directly from point clouds. Crucially, NEO circumvents the ill‑posed nature of standard eigenvector regression, which suffers from intrinsic sign flips and rotation ambiguities, by learning the stable, invariant low‑frequency subspace instead. Specifically, the network predicts a redundant set of basis functions whose span robustly covers the target eigenspace, allowing for the recovery of accurate eigenpairs via a lightweight Rayleigh‑‑Ritz refinement. To handle irregular sampling, we propose a mass‑aware neural operator that incorporates per‑point area weights into attention‑based aggregation, improving robustness to non‑uniform densities and enabling zero‑shot generalization across resolutions. Our approach achieves near‑linear runtime scaling and substantial wall‑clock speedups over iterative solvers at comparable accuracy, and exhibits strong zero‑shot transfer to high‑resolution point clouds. The resulting eigenpairs support standard spectral geometry tasks, while the raw basis functions provide effective point‑wise features for downstream learning. Code: https://github.com/Adversarr/NEO.

Authors:Kavin Soni, Debanshu Das, Vamshi Guduguntla
Title: Assessing the Operational Viability of Foundation Models for Time Series Forecasting
Abstract:
Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain‑specific training, feature engineering, and ongoing maintenance. Large‑scale foundation models have recently emerged as a zero‑shot alternative, avoiding task‑specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human‑centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold‑start or long‑tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade‑offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency.

Authors:Ke Sun, Yizhou Zhao, Jiayi Xin, Qi Long, Weijie Su
Title: CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning
Abstract:
Context or prompt‑level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. We address this gap by formulating prompt reweighting as a functional derivative of a utility functional defined in the pass‑rate function space, yielding a unified optimality framework that accommodates existing schemes, including REINFORCE and GRPO. Building on this optimality framework, we propose a distribution‑aware prompt reweighting approach, called CurveRL, based on a quantile coordinate transform, in which the weight assigned to each prompt depends not on the absolute value of pass rates but on its rank and density to reflect the distributional structure of the pass rates in the learning dynamics. Extensive experiments across multiple benchmarks demonstrate that our proposed CurveRL consistently outperforms GRPO and other RLVR baselines. Our study identifies context‑distribution control as a principled axis for analyzing and designing prompt‑reweighted RLVR algorithms. The code is released in https://github.com/zhyzmath/CurveRL.

Authors:Jinghan Jia, Joe Benton, Eric Easley
Title: Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning
Abstract:
Chain‑of‑thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt‑to‑answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information‑flow perspective: faithful reasoning should route answer‑relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt‑to‑answer shortcut. This perspective yields a task‑agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy‑based, masked‑KL, and gradient‑based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low‑entropy failure mode of KL‑based diagnostics where gradient‑based measures remain more stable. Building on this analysis, we introduce update‑time interventions for verifier‑based on‑policy RL, including attention masking, backward‑only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward‑hackable code repair, and DAPO‑Math models trained without hints but evaluated under wrong‑hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward‑hacking behavior more transparent in the CoT and improve task‑agnostic faithfulness metrics, while in some settings also reducing wrong‑hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety‑research/faithful‑cot.

Authors:Bowen Duan, Cong Guo, Chiyue Wei, Haoxuan Shan, Yuzhe Fu, Xinhua Chen, Yifan Xu, Ziyue Zhang, Changchun Zhou, Hai Li, Yiran Chen
Title: EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture
Abstract:
Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute‑bound GEMM operations, decoding executes a sequence of small GEMV‑like computations that are memory‑bound and underutilize modern accelerators. Weight‑only vector quantization (VQ) has emerged as an effective compression technique that clusters model weights into a shared codebook and replaces the original weight matrix with low‑precision indices, enabling 2‑bit‑level weight compression. While this approach substantially reduces model size and memory bandwidth, it still suffers from two critical inefficiencies: the low utilization of GEMV computation and frequent memory conflicts during codebook lookups. This paper presents EVA, an efficient vector‑quantization‑based architecture that addresses both computational and memory bottlenecks in LLM decoding. EVA builds on a simple yet effective insight that combines input‑codebook computation with conflict‑free memory access. Instead of reconstructing quantized weights from indices, EVA directly performs dot products between input vectors and the weight codebook, transforming LLM decoding from GEMV to GEMM computation. It then performs structured lookups from an intermediate output buffer, eliminating memory bank conflicts. We further design a hardware‑software co‑optimized architecture specialized for LLM decoding while remaining compatible with conventional prefill execution. Evaluations show that EVA achieves up to 11.17× speedup and 7.17× higher energy efficiency compared with the SOTA lookup‑based architecture, while preserving arithmetic precision after vector quantization. Our code is available at https://github.com/dbw6/Eva.git.

Authors:Sanchit Kabra, Nikhil Abhyankar, Saaketh Desai, Prasad Iyer, Chandan K Reddy
Title: LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs
Abstract:
Scientific discovery is a closed‑loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM‑AutoSciLab, a closed‑loop framework that couples hypothesis generation with hypothesis‑conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM‑AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed‑loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench‑Chem with 57 enzyme‑kinetics tasks and ActiveSciBench‑GRN with 45 gene‑regulatory‑network tasks. These datasets model discovery as a budget‑constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench‑Chem, and ActiveSciBench‑GRN, LLM‑AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench‑Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench‑GRN. Moreover, hypothesis‑guided experimentation is 2‑5x more sample‑efficient than the strongest competing baselines. Code and data are available at: https://github.com/scientific‑discovery/LLM‑AutoSciLab

Authors:Xiaotian Liu, Shuyuan Shang, Xiaopeng Wang, Pu Ren, Yaoqing Yang
Title: Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation
Abstract:
Neural operators serve as fast, data‑driven surrogates for scientific modeling but typically rely on a monolithic, single‑pass inference procedure that struggles to resolve high‑frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre‑trained operators with a learned refinement module iteratively applied via fixed‑point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high‑frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high‑frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72‑36.10% in low‑, 5.07‑6.68% in mid‑, and 1.48‑2.04% in high‑frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu‑dartmouth/Iterative_Refinement_Neural_Operator

Authors:Youwei Pang, Changsheng Gao, Dong Liu, Huchuan Lu, Weisi Lin
Title: Towards Large Model Feature Coding
Abstract:
Large models have delivered remarkable performance across a wide range of perception and generation tasks, yet practical deployment is increasingly constrained by computational and memory budgets, as well as privacy requirements. Split execution alleviates these constraints by partitioning computation across devices, but it inevitably introduces intensive transmission and storage of intermediate features. Unlike conventional feature coding for CNNs that typically targets homogeneous spatial activation maps, modern large models generate heterogeneous features with varying statistical distributions and compression tolerances, e.g., multi‑level/multi‑modal representations and autoregressive context caches. These characteristics necessitate treating large model feature coding (LaMoFC) as a fundamental system component and call for a systematic evaluation framework. In this paper, we present a comprehensive benchmark and evaluation framework for LaMoFC. We first build the feature dataset LaMoFCBench, covering diverse task requirements across 4 categories and 16 scenarios while integrating widelyadopted architectures and various split‑computing settings. We then specify representative split points according to practical application scenarios to extract intermediate features, establishing a unified pipeline for fair and reproducible comparisons. Finally, we benchmark mainstream universal feature codecs, exposing the profound misalignment between existing coding paradigms and the heterogeneous nature of large model features. These findings reveal that LaMoFC demands a fundamental departure from existing paradigms, and LaMoFCBench provides the shared empirical foundation to drive this transition. The data and code will be available at https://github.com/lartpang/LaMoFCBench.

Authors:Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo, Jiabin Liu
Title: Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving
Abstract:
Large language models (LLMs) are promising for autonomous driving, but semantics‑only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason‑‑Imagine‑‑Act (RIA), a closed‑loop framework that couples an LLM reasoner with an action‑conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub‑actions, the world model performs short‑horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point‑goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed‑loop interface, RIA consistently outperforms training‑free baselines, including CARLA TM and MADA, on core closed‑loop metrics. For reproducibility, code is available at https://github.com/pku‑smart‑city/source_code/tree/main/RIA.

Authors:Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz
Title: Nano World Models: A Minimalist Implementation of Future Video Prediction
Abstract:
World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision‑making. Yet, despite rapid progress in industry‑scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action‑conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long‑horizon rollout procedures. This design enables controlled studies of world‑modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real‑robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world‑model research.

Authors:Mingqing Wang, Zhiwei Nie, Athanasios V. Vasilakos, Yonghong He, Zhixiang Ren
Title: Learning Protein Structure-Function Relationships through Knowledge-guided Representation Decomposition
Abstract:
Proteins encode diverse functions within complex three‑dimensional structures, yet most deep learning representations remain highly entangled, obscuring the biophysical signals that underlie function. Here we introduce ProtDiS, a knowledge‑guided framework that decomposes pretrained protein micro‑environment embeddings into biologically grounded and task‑relevant dimensions. Inspired by the information bottleneck principle, ProtDiS learns representations that balance informativeness and compression, yielding structural features that are more specific, independent, and information‑efficient, and achieving consistent improvements across twelve downstream tasks, with the largest gains under structure‑based splits. Protein‑ and residue‑level analyses further show that ProtDiS differentiates proteins with similar folds but divergent functions and captures fine‑grained biophysical signals critical. These findings suggest that knowledge‑guided decomposition provides a general and interpretable approach for structuring latent spaces in protein structural modeling. The source code and implementation details are publicly available at https://github.com/AI‑HPC‑Research‑Team/ProtDiS.

Authors:Zhiyuan Zhai, Xinkai You, Wenjing Yan, Xin Wang
Title: How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning
Abstract:
Reasoning‑capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self‑reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while π, forced to terminate thinking and emit a final answer, still produces the correct answer. A large‑scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step‑level redundancy is consistently high ‑‑ between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions ‑‑ that the finding is robust to the choice of judge family, and that although ρ decreases with problem difficulty on MATH‑500, all four models remain substantially redundant (ρ\in [46%, 85%]) even on the hardest Level‑5 problems. We then prove that this redundancy is a structural consequence of length‑agnostic outcome rewards, not a model‑specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over‑thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: https://github.com/zhiyuanZhai20/how‑much‑thinking‑is‑enough

Authors:Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski
Title: Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
Abstract:
Visual geometry transformers have become powerful architectures for multi‑view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed‑forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two‑stage framework. First, an inter‑frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra‑frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity‑based strategy for inter‑frame selection, which ensures broad coverage of the scene. For intra‑frame selection, we show that layer‑aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed‑accuracy trade‑off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good‑token‑hunting.github.io.

Authors:Stuart Bladon, Brinnae Bent
Title: It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
Abstract:
It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre‑training phase. We tested seven open‑weight LLM pairs consisting of the base model (pre‑training only) and the chat model (pre‑training and post‑training) from seven labs on a paired‑scenario forced‑choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post‑training rather than in pre‑training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post‑training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China‑favourability (‑0.15 log‑odds, p=0.15), the post‑trained chat variant is at +2.91 (p<10^‑4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French‑made Mistral becomes pro‑France only under French prompting (FR‑EN shift +1.91, p<10^‑4). These findings suggest that geopolitical preferences in language models are not simply inherited from large‑scale internet data but are actively shaped during post‑training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

Authors:Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang
Title: Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models
Abstract:
Aiming at identifying unexpected inputs from unknown classes, out‑of‑distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post‑hoc OOD detection with pre‑trained vision‑language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM‑based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte‑Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state‑of‑the‑art in a variety of OOD detection setups. Code is publicly available at \hrefhttps://github.com/60pen9/Debiased‑Negative‑Mining‑Improves‑OOD‑Detection‑with‑Pre‑trained‑VLMs\textcolorredhere.

Authors:Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus-Robert Müller, Shinichi Nakajima
Title: Relevant Walk Search for Explaining Graph Neural Networks
Abstract:
Graph Neural Networks (GNNs) have become important machine learning tools for graph analysis, and its explainability is crucial for safety, fairness, and robustness. Layer‑wise relevance propagation for GNNs (GNN‑LRP) evaluates the relevance of \emphwalks to reveal important information flows in the network, and provides higher‑order explanations, which have been shown to be superior to the lower‑order, i.e., node‑/edge‑level, explanations. However, identifying relevant walks by GNN‑LRP requires \em exponential computational complexity with respect to the network depth, which we will remedy in this paper. Specifically, we propose \em polynomial‑time algorithms for finding top‑K relevant walks, which drastically reduces the computation and thus increases the applicability of GNN‑LRP to large‑scale problems. Our proposed algorithms are based on the \emphmax‑product algorithm ‑‑ a common tool for finding the maximum likelihood configurations in probabilistic graphical models ‑‑ and can find the most relevant walks exactly at the neuron level and approximately at the node level. Our experiments demonstrate the performance of our algorithms at scale and their utility across application domains, i.e., on epidemiology, molecular, and natural language benchmarks. We provide our codes under \hrefhttps://github.com/xiong‑ping/rel_walk_gnnlrpgithub.com/xiong‑ping/rel\_walk\_gnnlrp.

Authors:Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen, Ke Chen, Yaowei Wang
Title: CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception
Abstract:
High‑resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade‑off between coverage and efficiency. Visual expert‑assisted search is efficient but prone to blind spots when proposals fail, whereas scan‑based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training‑free adaptive framework that dynamically schedules search strategies via an Assess‑then‑Search workflow. Specifically, CVSearch first invokes expert‑assisted search when global information is insufficient, and only triggers a novel semantic‑aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom‑Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state‑of‑the‑art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26‑CVSearch.

Authors:Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue
Title: CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
Abstract:
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test‑Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground‑Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT‑free TTS, where existing methods directly use self‑generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT‑free, training‑free framework that jointly improves codes and UTs through cooperative self‑play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass‑count signals from the Code‑UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co‑evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output‑consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5‑7B‑Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE‑7B. When applied to CURE‑7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT‑free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.

Authors:Hanadi Alhamdan, Ghadah Alosaimi, Amir Atapour-Abarghouei, Farshad Arvin
Title: CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection
Abstract:
Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real‑world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE‑based oversampling with a class‑weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety‑oriented decision strategy based on class‑specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real‑world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority‑class recall and safety‑critical F‑score metrics while maintaining practical computational efficiency. Code: \url https://github.com/halhamdan/CBANet

Authors:Dai Shi, Luke Thompson, Linhan Luo, Lequan Lin, Andi Han, Junbin Gao, José Miguel Hernández Lobato
Title: S$^3$GNN: Efficient Global Mixing and Local Message Passing for Long-Range Graph Learning
Abstract:
Message‑passing neural networks (MPNNs) often suffer from an information bottleneck when capturing long‑range dependencies, leading to the oversquashing (OSQ) phenomenon. Alongside spatial connectivity enrichment (e.g., rewiring), recent studies have shown that spectral filtering can yield strong long‑range learning outcomes, as spectral operators enable global information mixing that alleviates OSQ. These approaches achieve this either by stabilizing the Jacobian energies in deep propagation or by guaranteeing OSQ mitigation under strong theoretical assumptions. We revisit these conclusions and show that the associated Jacobian sensitivity lower bound is generally difficult to achieve in practice. We then propose S^3GNN, which mitigates OSQ without such restrictive assumptions by lightweightly reintroducing omitted components with substantially lower computational complexity, while standard stability constraints on feature transformations remain effective under our new dynamics. Extensive experiments across diverse domains (e.g., long‑range benchmarks, KGQA, and mesh‑based fluid dynamics) demonstrate that S^3GNN achieves up to an order‑of‑magnitude error reduction with up to 50% fewer parameters. Our code can be found in https://github.com/EEthanShi/S3‑GNN.git.

Authors:Hongyi Li, Jun Xu, Hong Yan
Title: Hinge Regression Trees and HRT-Boost: Newton-Optimized Oblique Learning for Compact Tabular Models
Abstract:
Learning high‑quality oblique decision trees remains a significant challenge due to the discrete and non‑convex nature of split optimization. We present the Hinge Regression Tree (HRT) framework, which reframes each oblique split as a nonlinear least‑squares problem over two linear predictors whose max/min envelope induces ReLU‑like representation capacity. We show that the resulting node‑level optimization can be interpreted as a damped Newton method, and we establish the monotonic decrease of the node objective for its backtracking line‑search variant. We establish, theoretically, that HRT is a universal approximator with an explicit O(δ^2) approximation rate. Building upon this base learner, we propose HRT‑Boost, a mathematically synergistic ensemble extension that couples node‑level Newton updates with stage‑wise functional gradient descent. We show that this ensemble construction admits a stage‑wise empirical risk reduction guarantee under the squared loss. Empirical evaluations on synthetic and real‑world benchmarks show that HRT is highly competitive with established single‑tree baselines, and HRT‑Boost compares favorably with strong ensemble baselines and often yields substantially more compact models. The code is publicly available at https://github.com/Hongyi‑Li‑sz/HRT‑Boost.

Authors:Shuai Zhen, Yifan Zhang, Yuling Wang, Yanhua Yu
Title: Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
Abstract:
Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group‑invariant Markov Decision Processes (G‑invariant MDPs). Existing works in this direction have primarily focused on image‑based RL and rotational symmetry such as \mathrmSO(2), leaving state‑based RL and reflection symmetry largely underexplored. In this work, we focus on state‑based continuous control tasks and exploit reflection symmetry by introducing Reflex, a paradigm that seamlessly integrates with both on‑policy and off‑policy RL algorithms. We formalize two types of reflection‑axial reflection and bilateral reflection, and characterize their corresponding transformations. Building on a theoretical analysis of symmetry‑preserving optimal value functions and policies, Reflex integrates reflection symmetry into policy learning through principled symmetry regularization mechanisms. We integrate Reflex with PPO and SAC, and evaluate it on a suite of OpenAI Gym and DeepMind Control benchmarks, demonstrating superior performance over standard baselines while improving sample efficiency. Our code is available at https://github.com/TonyStark042/Reflex.

Authors:Eunwoo Heo, Kyeongkook Seo, Jaejun Yoo
Title: What Linear Probes Miss: Multi-View Probing for Weight-Space Learning
Abstract:
The explosive growth of open‑source model repositories has created a Model Jungle, where checkpoints are frequently shared without adequate documentation or metadata. While weight‑space learning offers a pathway to identify and analyze these models directly from their parameters, processing full‑scale weights is computationally prohibitive. Probing‑based methods have emerged as a lightweight alternative, extracting permutation‑equivariant representations via learnable probe vectors. However, existing probing methods are limited by a single‑view design: they capture first‑order structures but fail to encode the rich, higher‑order correlation patterns inherent in row‑column interactions. To bridge this gap, we introduce MVProbe, a multi‑perspective probing framework that synthesizes first‑order signals with interaction‑aware (Gram‑based) views. Our approach is theoretically grounded; we analyze the scaling laws of different probing orders to derive a principled standardization and fusion strategy that ensures balanced contributions from all branches. On the Model Jungle benchmark, MVProbe consistently outperforms the state‑of‑the‑art ProbeX across diverse architectures, including discriminative backbones (ResNet, SupViT, MAE, DINO) and large‑scale generative LoRA adapters (Stable Diffusion LoRA).

Authors:Jinglin Li, Jun Tan, QI Fang, Ning Gui
Title: Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting
Abstract:
Effectively modeling non‑stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and computation. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior that guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well‑calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non‑stationary data, offering a superior trade‑off between accuracy and computational efficiency. The code is available at https://github.com/ljl8336/PPM.

Authors:Po-Kai Chen, Niki van Stein, Aske Plaat
Title: Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition
Abstract:
Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key‑value template ϕ(S)U. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end‑to‑end paths with K/Q/V composition labels, and per‑token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT‑2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode‑specific routing of each connection (K, Q, or V). To test token‑level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate‑detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground‑truth circuit labels. Code is available at https://github.com/Fun‑Cry/unpacklm.

Authors:Shaoqing Duan, Haofei Song, Xintian Mao, Qingli Li, Yan Wang
Title: Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring
Abstract:
Defocus deblurring in pathological microscopy remains challenging due to the spatially varying and locally discontinuous nature of optical blur induced by a position‑dependent integral imaging process. Existing deep learning methods, constrained by shift‑invariance assumptions and limited interpretability, are not well suited to such heterogeneous blur patterns. Neural operators provide a principled alternative by modeling defocus formation directly as an integral operator, offering a new perspective on defocus deblurring. However, most existing neural operator architectures for low‑level vision rely on globally parameterized kernels that assume smoothness and stationarity, limiting their ability to model heterogeneous and locally discontinuous blur patterns. To address this limitation, we propose the Discontinuous Galerkin Neural Operator (DGNO), which parameterizes the integral kernel using a discontinuous Galerkin formulation with element‑local volume operators and interface numerical fluxes. DGNO provides a principled combination of locality, heterogeneity modeling, and global coherence while preserving the underlying physics of optical image formation. Extensive and insightful experiments demonstrate that DGNO surpasses state‑of‑the‑arts, delivering sharper reconstructions, robust handling of spatially varying blur, and scalable high‑resolution performance. The code will be released at https://github.com/DeepMed‑Lab‑ECNU/Single‑Image‑Deblur.

Authors:Woohyun Lee, Hogun Park
Title: Self-supervised Adversarial Purification for Graph Neural Networks
Abstract:
Defending Graph Neural Networks (GNNs) against adversarial attacks requires balancing accuracy and robustness, a trade‑off often mishandled by traditional methods like adversarial training that intertwine these conflicting objectives within a single classifier. To overcome this limitation, we propose a self‑supervised adversarial purification framework. We separate robustness from the classifier by introducing a dedicated purifier, which cleanses the input data before classification. In contrast to prior adversarial purification methods, we propose GPR‑GAE, a novel graph auto‑encoder (GAE), as a specialized purifier trained with a self‑supervised strategy, adapting to diverse graph structures in a data‑driven manner. Utilizing multiple Generalized PageRank (GPR) filters, GPR‑GAE captures diverse structural representations for robust and effective purification. Our multi‑step purification process further facilitates GPR‑GAE to achieve precise graph recovery and robust defense against structural perturbations. Experiments across diverse datasets and attack scenarios demonstrate the state‑of‑the‑art robustness of GPR‑GAE, showcasing it as an independent plug‑and‑play purifier for GNN classifiers. Our code can be found at https://github.com/woodavid31/GPR‑GAE.

Authors:Minju Kim, Youngbum Hur
Title: PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows
Abstract:
Time series forecasting plays a central role in many real‑world applications and has been extensively studied. Most existing approaches rely on deterministic models. However, real‑world environments exhibit inherently uncertain and complex future behaviors, making single‑point predictions insufficient. This highlights the need for probabilistic forecasting methods that can quantify and represent uncertainty. In this work, we propose PaP‑NF, a probabilistic forecasting framework that aligns continuous time series representations with a frozen large language model (LLM) using a Prefix‑as‑Prompt mechanism, and conditions a normalizing flow decoder on the global context extracted by the LLM. The quality of the resulting predictive distributions is evaluated using the Continuous Ranked Probability Score (CRPS), a standard metric in probabilistic forecasting. Across a variety of long‑term forecasting benchmarks, PaP‑NF robustly captures multi‑modal uncertainty while maintaining competitive point forecasting accuracy. The official implementation is available at: https://github.com/democracy04/PaP‑NF

Authors:Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari
Title: FastKernels: Benchmarking GPU Kernel Generation in Production
Abstract:
LLM‑based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation‑stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production‑grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under‑served architectures; each task's interface mirrors the corresponding module in the state‑of‑the‑art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state‑of‑the‑art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53× ‑‑ confirming that benchmark‑production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake‑AI‑Research/fastkernels

Authors:Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio
Title: Lipschitz Optimization for Formal Verification of Homographies
Abstract:
The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety‑critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to \ell_p‑norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed‑form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image‑formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN‑COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real‑world case study on a safety‑critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography‑verification .

Authors:Guoming Li, Shangyu Zhang, Junwei Pan, Wentao Ning, Jin Chen, Gengsheng Xue, Chao Zhou, Shudong Huang, Haijie Gu, Menglin Yang
Title: Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation
Abstract:
Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per‑token feedforward networks (P‑FFNs) to achieve scalable performance. However, RankMixer suffers from embedding collapse, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P‑FFN modules as the primary causes of this phenomenon, jointly inducing a damped oscillatory trajectory in effective‑rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum‑robust representations with provable collapse mitigation. RankElastor introduces two components: (i) parameterized full mixing, which enables expressive token mixing with improved spectral robustness; and (ii) GLU‑improved P‑FFNs, which stabilize representation spectra through GLU‑style FFN modules. Extensive experiments on large‑scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: https://github.com/vasile‑paskardlgm/RankElastor

Authors:Jaehyeop Hong, Youngbum Hur
Title: CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection
Abstract:
Multivariate time series anomaly detection has become increasingly important in real‑world applications, where labeled data are often scarce. Many existing approaches rely on unsupervised learning to model normal patterns, but they often treat all channels equally. This design can dilute anomaly‑relevant signals, since not all channels contribute equally to anomaly detection. In this paper, we propose CALAD, a channel‑aware contrastive learning framework for multivariate time series anomaly detection. CALAD governs the construction of contrastive samples using estimated channel relevance, allowing the learning process to reflect anomaly semantics rather than generic similarity. Channel relevance is estimated from reconstruction errors of a transformer‑based autoencoder and is used to distinguish channels that are more influential to anomalous behaviors. Using this information, we design a channel‑wise augmentation strategy in which positive and negative samples are constructed based on whether anomaly‑relevant channels are preserved or perturbed. This encourages invariance to changes in irrelevant channels while being sensitive to changes in anomaly‑relevant channels. Furthermore, CALAD combines contrastive learning and an auxiliary reconstruction head, allowing the model to learn discriminative representations while retaining normal structures. Experiments on multiple real‑world datasets shows that CALAD consistently outperforms existing methods, particularly under distribution shift scenarios. We provide the code for reproducibility at https://github.com/hirundo1218/CALAD

Authors:Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein
Title: Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking
Abstract:
Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade‑off: end‑to‑end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration‑segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration‑proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally‑informed segmentation. To address data scarcity, we leverage large‑scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out‑of‑distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC‑DKFZ/LongiSeg

Authors:Joe Sharratt
Title: ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Abstract:
Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long‑context workloads. Prior work utilises block‑scaled quantisation techniques on Blackwell GPUs to move attention computation to 4‑bit precision to accelerate inference. However, these techniques result in significant quality degradation in long‑context settings. We show that the output impact of quantisation error is highly non‑uniform and increases with the importance of each query‑key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low‑bit attention variant that delivers near‑FP16 long‑context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query‑key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long‑context benchmarks and model families that by computing only 5% of query‑key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4‑to‑FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.

Authors:Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu
Title: GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
Abstract:
Mixture‑of‑Experts Large Language Models (MoE‑LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed‑precision quantization mitigates this cost by allocating expert‑wise bit‑widths based on their importance, approaching the accuracy‑memory Pareto frontier and enabling extreme low‑bit quantization. However, existing methods rely on layer‑wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert‑level Mixed‑precision Quantization (GEMQ) to overcome these limitations via (1) a global linear‑programming formulation that captures model‑wide expert importance based on quantization error analysis, and (2) efficient router fine‑tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

Authors:Colin Gaffney, Shutong Li, Daniel Ng, Anastasia Petrushkina, Niket Kumar, Adam Cogdell, Mridul Sahu, Yaning Liang, Nikhil Bansal, Justin Pan, Angel Mau, Abhishek Agrawal, Marco Berlot, Ruoxin Sang, Kiranbir Sodhia, Rakesh Iyer
Title: Orbax: Distributed Checkpointing with JAX
Abstract:
In a landscape of high‑performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX's modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a modular, JAX‑native checkpointing library that abstracts the complexities of distributed accelerator systems while also providing flexibility for user‑friendly checkpoint manipulations throughout the ML model lifecycle. We demonstrate performance exceeding comparable PyTorch competitors by up to 3.5× for saving and 2× for loading. The library is available at https://github.com/google/orbax.

Authors:Simone Antonelli, Sadegh Akhondzadeh, Aleksandar Bojchevski
Title: Test-Time Training Undermines Safety Guardrails
Abstract:
Test‑Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few‑shot learning, retrieval‑augmented generation, and complex reasoning. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. We identify three threat models for TTT and demonstrate how attackers can leverage them to bypass safety filters. Our results show that TTT can significantly increase the Attack Success Rate (ASR) and the ASR over 10 generation trials (ASR@10). For example, under LoRA, the few‑shot and generation‑phase threat models achieve an average ASR@10 of 95% and 93% respectively, across models from different families and scales. These vulnerabilities transfer to production fine‑tuning APIs. We also show that TTT‑induced overfitting can produce degenerate outputs that inflate ASR under standard judges, and propose a validity‑aware evaluation to correct for this. Our findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. As a first step toward defense, we propose a lightweight provider‑side detector that flags TTT requests via the perplexity shift on a private harmful holdout, but robust deployment will ultimately require dynamic alignment.

Authors:Benjamin Rozonoyer, Jacopo Minniti, Dhruvesh Patel, Neil Band, Avishek Joey Bose, Tim G. J. Rudner, Andrew McCallum
Title: Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Abstract:
When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded, forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (Relay), a method that allows MDMs to be forward‑thinking when denoising by explicitly learning how to propagate latent information for the benefit of future denoising steps. Relay introduces a differentiable per‑token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state‑of‑the‑art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in Relay on a challenging Sudoku‑based planning task. We then scale Relay to Fast‑dLLM v2, a state‑of‑the‑art DLM, outperforming standard supervised finetuning on coding tasks while reducing inference latency by up to 32%. Our empirical results demonstrate that state‑of‑the‑art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance‑latency Pareto frontier. We provide code for all our experiments.

Authors:Shubham Parashar, Atharv Chagi, Jacob Helwig, Lakshmi Jotsna, Sushil Vemuri, James Caverlee, Dileep Kalathil, Shuiwang Ji
Title: Learnability-Informed Fine-Tuning of Diffusion Language Models
Abstract:
We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post‑training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT‑based post‑training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.

Authors:Jingyan Zhang, Han Liang, Ruichi Zhang, Bin Li, Juze Zhang, Xin Chen, Jingya Wang, Lan Xu, Jingyi Yu
Title: SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control
Abstract:
Controlling physics‑based humanoids from natural‑language instructions is a critical step toward general‑purpose embodied agents. However, existing methods remain constrained by a tension between semantic expressiveness and physical feasibility, often failing to jointly achieve faithful instruction following, high‑quality motion, and stable long‑horizon control. We propose SCRIPT, a scalable diffusion policy with a multi‑stage training framework for language‑driven physics‑based humanoid control. The core of SCRIPT is a Joint Action‑State‑Text Diffusion Transformer (JAST‑DiT), which represents actions, physical states, and text as dedicated token streams and couples them through joint attention, enabling direct interaction between language semantics and control dynamics. To stabilize autoregressive control, we introduce a nonlinear history conditioning mechanism, which preserves the dense recent context and samples increasingly sparse cues from long‑term history. Beyond supervised imitation pre‑training, we propose a post‑training stage, further improving the performance using Reinforcement Learning with Hybrid Rewards (RLHR). By injecting learnable noise into the flow‑sampling process, RLHR effectively improves motion quality and instruction following within closed‑loop simulations using hybrid physical feedback and text rewards. Quantitative evaluations demonstrate that SCRIPT outperforms prior state‑of‑the‑art methods, with gains across text alignment, motion quality, and physical realism metrics. Furthermore, scaling studies on the 1200‑hour MotionMillion dataset demonstrate consistent performance gains with model scaling, highlighting SCRIPT's robust scalability for large‑scale pre‑training. Our code will be publicly available for future research.

Authors:Yequan Zhao, Ruijie Zhang, Liyan Tan, Niall Moran, Tong Qin, Zheng Zhang
Title: FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning
Abstract:
Both full fine‑tuning (Full FT) and parameter‑efficient fine‑tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limited fine‑tuning data can perturb robust pretrained features. We identify spectral preconditioning as the missing ingredient: reparameterizing each weight matrix through its full‑rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme that outperforms unconstrained Full FT at the same trainable parameter count. Building on this insight, we propose FuRA (Full‑Rank Adaptation), an efficient full‑rank adaptation framework based on a block tensor‑train factorization W = LSR, where the large core L is fixed to the pretrained block‑wise SVD basis, while only the compact core R and the block‑wise singular values S are optimized. This design simultaneously provides full‑rank spectral preconditioning, preserves full‑rank update expressivity, and achieves parameter, memory, and step‑time efficiency comparable to LoRA. FuRA consistently outperforms Full FT across multiple settings, including LLM fine‑tuning (+1.37 on LLaMA‑3‑8B commonsense reasoning), LLM reinforcement learning for mathematical reasoning, and visual instruction tuning for VLMs. Furthermore, the 4‑bit quantized variant, QFuRA, also surpasses QLoRA. Code is available at https://github.com/olokevin/FuRA‑NIPS

Authors:Yingjie Lei
Title: PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations
Abstract:
Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator‑based benchmark for hidden‑preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle‑customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter‑offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM‑facing state‑summary protocol that constrains agents to return strict JSON actions under a fixed hidden‑information boundary. We evaluate zero‑shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller‑profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement‑seeking behavior can coexist with weak profit‑sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing‑agent behavior under hidden buyer preferences.

Authors:Lily Goli, Justin Kerr, Daniele Reda, Alec Jacobson, Andrea Tagliasacchi, Angjoo Kanazawa
Title: Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
Abstract:
Exploration is a prerequisite for learning useful behaviors in sparse‑reward, long‑horizon tasks, particularly within 3D environments. Curiosity‑driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL‑based active mapping baselines and generalizes zero‑shot to Gibson and AI‑generated worlds. Our end‑to‑end policy enables efficient adaptation to downstream tasks, such as apple picking and image‑goal navigation, outperforming from‑scratch baselines. Please see video results at https://recuriosity.github.io/.

Authors:Vishal Rajput
Title: The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
Abstract:
Robustness, domain adaptation, photometric and occlusion invariance, compositional generalisation, temporal robustness, alignment safety, and classical anisotropic regularisation are usually treated as separate problems with separate method families. This paper argues that much of their shared structure is one statistical problem: estimate the covariance of label‑preserving deployment nuisance, then regularise the encoder Jacobian along a matrix whose range covers that covariance (the matching principle). CORAL, adversarial training, IRM, augmentation, metric learning, Jacobian penalties, and alignment‑style constraints are different estimators of that object, not independent robustness tricks. In the linear‑Gaussian model we prove closed‑form optimality (Theorem A), including cube‑root water‑filling within the matched range; necessity of range coverage for quadratic Jacobian penalties (Theorem G); the same range dichotomy at deep global minima; and two falsification controls (Lemma C; Corollaries E), with seven conditional consistency lemmas (D1‑D7) for estimation under standard identifiability assumptions. We introduce the Trajectory Deviation Index (TDI), a label‑free probe of embedding sensitivity when task accuracy or Jacobian Frobenius norm is insufficient. Thirteen pre‑registered blocks from classical ML through Qwen2.5‑7B test the predicted matched, then isotropic, then wrong‑W ordering on geometry and deployment drift; twelve pass, and the sole exception (Office‑31) is an eigengap failure named before the run. At 7B scale, matched style‑PMH improves selective honesty and preserves Style TDI where standard DPO degrades it. The contribution is naming the deployment nuisance covariance, stating what the regulariser must do, and supplying a closed‑form falsifiable theory once that object is identified, not universality on every leaderboard.

Authors:Qianshu Cai, Yonggang Zhang, Xianzhang Jia, Wei Xue, Jun Song, Xinmei Tian, Yike Guo
Title: MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
Abstract:
Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human‑driven update ships a fix. Self‑evolving agents have emerged in response, but all confine evolution to text‑mutable artifacts ‑‑ skill files, prompt configurations, memory schemas, workflow graphs ‑‑ and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source‑level adaptation is a fundamentally more general medium: it is Turing‑complete, a strict superset of every text‑mutable scope, takes effect deterministically rather than through base‑model compliance, and does not erode under long‑context drift. We present MOSS, a system that performs self‑rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production‑failure evidence and proceeds through a deterministic multi‑stage pipeline; code modification is delegated to a pluggable external coding‑agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user‑consent‑gated, in‑place container swap with health‑probe‑gated rollback. On OpenClaw, MOSS lifts a four‑task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

Authors:Samson Gourevitch, Yazid Janati, Dario Shariatian, Umut Simsekli, Eric Moulines, Eric P. Xing, Alain Durmus
Title: Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation
Abstract:
Discrete diffusion models are often trained through clean‑data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug‑in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave‑one‑out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug‑in ELBO and the usual cross‑entropy denoising objective. We characterize the leave‑one‑out target and derive exact conversions between the denoiser, the leave‑one‑out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor‑corrector sampler and improved temperature sampling based on the leave‑one‑out predictor. We further introduce an absorbing‑state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked‑diffusion‑like sampling operations, with simpler denoising posteriors, carry‑over unmasking, and a natural remasking mechanism. On language modeling, leave‑one‑out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.

Authors:Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo, Reza Shokri
Title: The Distillation Game: Adaptive Attacks & Efficient Defenses
Abstract:
Distillation attacks create a deployment trade‑off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade‑off through a minimax game between a utility‑constrained teacher and an adaptive student. Our framework yields tractable one‑sided response rules: an adaptive evaluation rule in which the student reweights high‑value examples, and a teacher‑side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product‑of‑Experts (PoE), a simple forward‑pass‑only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive‑‑adaptive gap: on state‑of‑the‑art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher‑quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation‑game.

Authors:Sid-ali Temkit
Title: AMEL: Accumulated Message Effects on LLM Judgments
Abstract:
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open‑source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = ‑0.17, p < 10^‑46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = ‑0.34 for high‑entropy items, vs d = ‑0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^‑39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku ‑0.22 to Opus ‑0.17; OpenAI: Nano ‑0.34 to GPT‑5.2 ‑0.17). Three follow‑ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token‑level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50‑turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

Authors:Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang
Title: SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
Abstract:
While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end‑to‑end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post‑hoc step. To bridge this interpretability gap, we propose SegCompass, an end‑to‑end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image‑instruction pair, SegCompass first generates a chain‑of‑thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high‑dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi‑slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE‑driven interface provides a "white‑box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state‑of‑the‑art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU‑Heliodore/SegCompass.

Authors:Víctor Yeste, Paolo Rosso
Title: More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
Abstract:
Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine‑grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence‑level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full‑document inputs; no‑RAG and retrieval‑augmented settings with a curated moral knowledge base; supervised DeBERTa‑v3‑base/large encoders; and zero‑shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full‑document context improves supervised DeBERTa encoders by 3.8‑4.8 macro‑F1 points over sentence‑only input, but does not consistently help zero‑shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa‑v3‑base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late‑fusion and cross‑attention RAG variants for encoders. Per‑value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value‑sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

Authors:Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo
Title: The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
Abstract:
While multi‑task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone‑agnostic optimizer named Conflict‑Averse Magnitude‑Enhanced Gradient Descent (CAME‑Grad). Through conflict‑averse direction rectification and magnitude‑enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task‑specific inductive bias. Experiments show that as a universal plug‑and‑play optimizer, CAME‑Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC‑CXR and 1.9% on IU X‑Ray. Our code is available at https://github.com/vpsg‑research/CAME‑Grad.

Authors:Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu
Title: MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy
Abstract:
Learning real‑world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real‑world objects typically exhibit mild anisotropy and heterogeneity. After the near‑isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real‑to‑sim gap. Although neural networks can fit dynamics end‑to‑end, such black‑box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion‑constrained stress adaptation framework that targets these residual effects to further improve real‑to‑sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane‑constrained redistribution in a physics‑informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real‑to‑sim dynamics modeling translates into more reliable sim‑to‑real transfer. Project Page is available at https://mercerai.github.io/MoSA/.

Authors:Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor
Title: SceneAligner: 3D-Grounded Floorplan Localization in the Wild
Abstract:
Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small‑scale environments and precise vectorized floorplans, limiting their ability to operate in large‑scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity‑aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross‑modal correspondences, introducing a fine‑tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

Authors:Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti
Title: SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
Abstract:
Today, tool‑calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre‑deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi‑turn, tool‑calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine‑grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi‑axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae‑2026‑synae‑demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

Authors:Lucas Sheneman
Title: The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning
Abstract:
Scientific machine learning often requires combining known physics with unknown parameters or correction terms learned from data. Existing approaches either ignore known structure, encode it as a soft penalty, or require hand‑written PyTorch code for each equation. We present The Neural Compiler, a system that translates programs written in a first‑order Scheme‑like expression language into frozen, differentiable PyTorch modules. These modules match the source program to floating‑point precision and provide gradients through autograd. In hybrid models, the compiled module encodes known physics exactly while learned components model the unknown remainder. We evaluate the compiler across six experiment domains: Feynman physics equations, Lotka‑Volterra dynamics, a damped pendulum, a one‑dimensional heat equation, three‑dimensional vector mechanics, and compositional generalization. Compiled modules match hand‑coded PyTorch implementations numerically for single equations, showing no accuracy loss from compilation. With only 1 to 4 trainable parameters, compiled models recover physical constants to less than 1 percent error in most cases, while standard PINN baselines with more than 8500 parameters show 7 to 93 percent error. Compiled modules also compose with zero error, while neural approximations can accumulate large errors in deep composition chains. The main value of the compiler is not improved accuracy over hand‑coded equations, but systematic composability: it generates correct, differentiable modules from symbolic specifications without rewriting each equation by hand. The system supports 51 primitive operations, including vector and matrix algebra, enabling PDE discretizations and hybrid scientific models. This string‑in, module‑out interface also provides a natural target for large language models that translate scientific descriptions into executable differentiable modules.

Authors:An Xuan Nguyen
Title: Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference
Abstract:
Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key‑Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out‑of‑Memory events drop 7.6% and request throughput improves 1.83x to 13.3x across synthetic workloads and 2.36x on ShareGPT. All gains hold under paired‑bootstrap 95% confidence intervals. A phase‑time breakdown reveals two distinct mechanisms: shorter OOM recovery on capacity‑pressured workloads, and faster allocation calls on KV‑heavy workloads. Implementation is pure Python; Triton integration is future work.

Authors:Yifan Bai, Xiaoyang Liu, Zihao Mou, Guihong Wang, Jian Yu, Shuhan Xie, Yantao Li, Yangyu Zhang, Jingwei Liang, Tao Luo
Title: VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation
Abstract:
As large language models (LLMs) are increasingly deployed for software engineering, constructing high‑quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated code. However, existing benchmarks are limited by the quantity and quality of positive and negative test cases, leading to an overestimation of model capabilities in generating specifications and implementations. To address this, we propose VeriScale, a novel framework driven by the adversarial implementations. It consists of two stages: test‑suite expansion to construct diverse and challenging test cases, and test‑suite reduction to distill them into compact yet discriminative suites. While VeriScale is general, we instantiate it on Verina to construct VerinaPlus, which expands the original test suites by over 83×, and VerinaLite, a lightweight 14× variant. Our experiments across eight state‑of‑the‑art LLMs demonstrate that VerinaPlus exposes substantial model weaknesses hidden by the original benchmark, evidenced by sharp score drops on both SpecGen and CodeGen tasks, whereas VerinaLite maintains this discriminative power at a fraction of the evaluation cost. The enhanced benchmarks and source code are publicly available at https://github.com/XiaoyangLiu‑sjtu/VeriScale.

Authors:Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu
Title: TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
Abstract:
Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large‑scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre‑training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end‑to‑end, map‑free route generation directly from origin‑destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD‑ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.

Authors:Fabian Morelli, Stephan Eckstein
Title: Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
Abstract:
Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuron‑level similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity. Our code is available at https://github.com/Fabian‑Mor/partial_fusion_nn.

Authors:Santiago Ospitia, John Sanabria, John Garcia-Henao
Title: SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection
Abstract:
Despite strong predictive results in the clinical machine learning literature, the translation of these models into bedside use remains limited by systems‑level barriers: heterogeneous data representations, the absence of standardized deployment workflows, and a mismatch between research prototypes and the concurrency and latency requirements of hospital environments. We present the SepsisAI‑Orchestrator, an open‑source modular platform that addresses this deployment gap for early sepsis detection. The platform integrates HL7 FHIR‑inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, orchestrated with Docker and Kubernetes. A previously validated LightGBM model (F1 0.87‑0.94 on PhysioNet 2019) is reused without modification; the contribution lies in the surrounding infrastructure and its empirical characterization under load. Using k6 with 50‑1000 concurrent virtual users, we find that replica count must be matched to the physical CPU thread count of the host: scaling from 3 to 12 replicas on a 12‑thread CPU reduces p95 latency from 3.3s to 1.41s (57.3% reduction) and eliminates all request failures, while over‑provisioning to 24 or 48 replicas degrades performance due to scheduler contention. To our knowledge this U‑shaped scaling behavior has not been quantified previously for clinical AI inference workloads. We do not claim prospective clinical validation. Source code and deployment manifests are available at https://github.com/nucleusai/sepsisai‑orchestrator.

Authors:Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu
Title: One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Abstract:
Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy‑Tailed Self‑Regularization (HT‑SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy‑tailedness. Layers with weaker heavy‑tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy‑tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT‑nano), optimizers (AdamW and Muon), and parameter scales (60M‑1B) demonstrate that LLR achieves up to 1.5x training speedup and outperforms baselines, notably raising average zero‑shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Code is available at https://github.com/hed‑ucas/Layer‑wise‑Learning‑Rate.

Authors:Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao
Title: Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Abstract:
The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert‑Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)‑driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision‑making process over a hierarchical model‑skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two‑tier skill library, deciding at each step whether to invoke an external expert, which model‑skill pair to select, and when to terminate. The policy is optimized via outcome‑based RL, requiring no step‑level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high‑resolution perception, and domain‑specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT‑5 (69.3%) and Gemini‑2.5‑Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out‑of‑domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed‑source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

Authors:Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing
Title: Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
Abstract:
How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain‑of‑thought), trained end‑to‑end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision‑making into three systems: simulative reasoning (System II) grounding deliberation in future‑state prediction via a world model; self‑regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine‑grained action. Simulative reasoning provides unified planning across diverse tasks without per‑domain engineering, while self‑regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self‑Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain‑of‑thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi‑module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1‑8B and v1.0‑30B achieve Pass@1 competitive with 120‑355B and 685B‑1T parameter systems respectively, while v1.0‑30B uses 25.8‑95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self‑regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

Authors:Hyeseong Kim, Geonhui Son, Deukhee Lee, Dosik Hwang
Title: TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
Abstract:
Novel view synthesis from sparse‑view inputs poses a significant challenge in 3D computer vision, particularly for achieving high‑quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non‑rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control‑point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip‑NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse‑view scenarios.

Authors:Hongsin Lee, Hye Won Chung
Title: Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Abstract:
Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the min‑max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data ‑‑ the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two‑layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real‑image classification datasets, confirming that robust overfitting is driven by the teacher's interaction with unlearnable samples. Finally, we demonstrate that a teacher's predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.

Authors:Yifan Lan, Yuanpu Cao, Hanyu Wang, Lu Lin, Jinghui Chen
Title: The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
Abstract:
Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero‑CoT Probe (ZCP), a novel black‑box detection method that deliberately truncates the entire Chain‑of‑Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem‑solving capabilities, ZCP compares the model's zero‑CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine‑tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan‑Lan/zero‑cot‑probe.

Authors:Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso
Title: EntmaxKV: Support-Aware Decoding for Entmax Attention
Abstract:
Long‑context decoding is increasingly limited by KV‑cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, α‑entmax produces exact zeros, turning sparse decoding from dense‑tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax‑native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query‑aware page scoring, support‑aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass δ, showing that output error is controlled by δ and vanishes when the entmax support is recovered. We further introduce a Gaussian‑aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax‑based sparse decoding at matched KV budgets. On long‑context and language modeling benchmarks, it closely matches full‑cache entmax while using a small fraction of the KV cache, achieving up to 3.36× (softmax) and 5.43× (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep‑spin/entmaxkv.

Authors:Mansoor Ahmed, Murray Patterson
Title: AgForce Enables Antigen-conditioned Generative Antibody Design
Abstract:
Antibody design methods condition on antigen structure to generate complementarity‑determining regions (CDR), yet a systematic evaluation of baseline methods reveals that they largely ignore the antigen input. We identify three failure modes that explain this behavior. Antigen blindness arises because models derive predictions from antibody framework context rather than antigen information, producing nearly identical CDRs regardless of the target. Vocabulary collapse reduces predicted amino acids to three to five per position, far below the ground truth distribution in native sequences. Moreover, any model trained with standard per‑position cross‑entropy converges to the positional marginal distribution, making it provably unable to produce antigen‑specific sequence predictions. We propose a novel encoder‑decoder architecture called AgForce, that uses a graph neural network (GNN) as the encoder and specialized decoders for sequence‑structure co‑design. Specifically, we apply framework dropout, gated bottlenecks, and hyperbolic cross attention that prevent the antibody shortcut path. In the decoder, a Mixture Density Network (MDN) sequence head with Potts‑like pairwise coupling and annealed Multiple Choice Learning (aMCL) replaces the cross‑entropy objective with a multi‑component distribution whose optimal solution differs from the positional marginal. An antigen cycle consistency head routes gradients through the sequence decoder, forcing predicted distributions to encode antigen identity. AgForce achieves the best binding quality and sequence recovery simultaneously on the CHIMERA‑Bench dataset, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the baselines across all interface metrics, and nearly doubling the effective vocabulary of GNN methods. The source code is available at: https://github.com/mansoor181/ag‑force.git

Authors:Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao
Title: When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
Abstract:
On‑policy self‑distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student‑visited prefix. Existing entropy‑based OPD methods relax this uniformity by modulating token‑level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non‑viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch‑viability diagnostic. Specifically, we record next‑token alternatives from the privileged‑answer teacher prompt, force each alternative after the student prompt plus its on‑policy spine prefix, and test whether the resulting student‑template continuation recovers the correct answer. On Qwen3‑4B, we find that an oriented within‑sequence position score is the strongest tested predictor of teacher‑token reliability, reaching an area‑under‑ROC‑curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory‑level structure, we propose Position‑Weighted On‑Policy Self‑Distillation (PW‑OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward‑KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic‑derived PW‑OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger‑scale models from different families, DeepSeek‑R1‑Distill‑Llama‑8B and Olmo‑3‑7B‑Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher‑token reliability in reasoning distillation is trajectory‑structured and can be utilized without additional teacher computation.

Authors:Xuyang Zhong, Qizhang Li, Yiwen Guo, Chen Liu
Title: DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models
Abstract:
We propose DualOptim+, a novel optimization framework for improving machine unlearning in large language models. It introduces a base state to capture common representations shared by forgetting and retaining objectives and delta states to preserve objective‑specific residuals. This architecture allows the optimizer to adaptively bridge shared and decoupled states based on the directional conflict between forgetting and retaining gradients. We further introduce DualOptim+ 8bit, a quantized variant that reduces memory overhead without compromising performance. Extensive experiments across fictitious and real‑world unlearning, safety alignment, and multi‑task learning tasks demonstrate that DualOptim+ consistently achieves a superior trade‑off between different objectives. Codes are available at https://github.com/CityU‑MLO/DualOptimPlus.

Authors:Jimmy Dubuisson
Title: Community-Aware Vertex Ordering for Reference-Based Graph Compression: A Cross-Encoder Empirical Study
Abstract:
Reference‑based graph compression encodes each vertex's neighbor list relative to a recent vertex, exploiting locality to compress large directed graphs. The dominant tool, WebGraph's BVGraph, fixes a single encoding pipeline and relies on a separately chosen vertex ordering ‑‑ typically URL‑lexicographic or Layered Label Propagation (LLP). The interaction between ordering and encoder is rarely measured. We propose a two‑stage Leiden+LLP vertex ordering ‑‑ global LLP to seed labels, Leiden community detection, then per‑cluster LLP on each induced subgraph ‑‑ and study how it interacts with reference‑based compression. On graphs with poor initial vertex order, reordering saves 0.3 to 5.4 bits per edge on every dataset and encoder we measured. The size of that gain is largely insensitive to the encoder: on four of five weakly ordered datasets, four independently parameterised encoders agree on the Leiden+LLP‑vs‑plain‑LLP gain within roughly +/‑ 0.04 bpe. On URL‑ordered web crawls, where the distributed ordering already encodes locality, adaptive encoders still benefit from reordering, but encoders tuned to URL‑induced residual structure (BV‑HC, CG at K>1) are mildly hurt by it. To quantify how much encoder choice matters once ordering is fixed, we contribute three reference‑based encoders ‑‑ BG, CS, and CG ‑‑ that perform per‑vertex cost‑optimal selection from up to 28 candidate decompositions. Each is run under its own best‑tested ordering. The best of the three improves over BVGraph high‑compression by 2‑9% on every dataset tested, with the encoder‑level gain consistently smaller than the ordering‑level gain on weakly ordered datasets. The encoder framework also yields a self‑delimiting bitstream that supports low‑overhead random access.

Authors:Brandon Dent
Title: HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine
Abstract:
Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical‑QA benchmarks miss the failure modes that matter in emergency medicine: trajectory‑level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement‑learning environment that rewards trajectory‑level safety under realistic emergency‑medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual‑layer rubric that zeroes reward whenever any safety‑critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety‑critical); a post‑hoc 10‑task negative‑class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5‑28.4] and GPT‑5.4 at 12.6% [10.2‑15.6], with safety‑failure rates of 27.5% and 34.0%. On multi‑step workflows ‑ the closest proxy to real emergency care ‑ performance collapses to near zero (Claude 1.0%, GPT‑5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re‑ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM‑judge overlay bounds evaluator noise, and a 60‑run negative‑class smoke pilot shows the reward signal is not drop‑in training‑safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training‑reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

Authors:Drake Caraker, Bryan Arnold, David Rhoads
Title: The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity
Abstract:
No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine‑verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist ‑‑ faithful‑complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) ‑‑ and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1‑rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto‑optimal among unbiased aggregations, achieving the Cramer‑Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics ‑‑ a Z‑test workflow and single‑model screening tool ‑‑ and has direct consequences for fairness auditing: SHAP‑based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) ‑‑ to our knowledge, the first formally verified impossibility in explainable AI.

Authors:Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng
Title: You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low‑rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank‑1 approximation of the parameter deltas, where the magnitude of this projection evolves near‑linearly with training steps. Motivated by this, we propose a simple and compute‑efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank‑1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5‑Math‑1.5B, Qwen3‑4B‑Base, and Qwen3‑8B‑Base), RELEX produces checkpoints that match or exceed RLVR performance on both in‑domain and out‑of‑domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10‑20× beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non‑linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank‑1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

Authors:Alim Igilik
Title: Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment
Abstract:
Standard approaches to forecasting the weekly number of earthquakes on a spatial grid rely on the Poisson distribution with a single global dispersion assumption. We show that this assumption is systematically violated in seismic data from Central Asia (2010‑2024), where a likelihood‑ratio test with boundary correction strongly rejects the Poisson hypothesis (p < 10^‑179). The main contribution of this work is the EarthquakeNet architecture, which provides an endogenous per‑cell estimate of the overdispersion parameter alpha via a neural network (spatial embeddings + MLP), without explicit spatial covariance specification. In contrast to existing negative binomial regression approaches in seismological forecasting, which typically assume a single global alpha, the proposed per‑cell formulation allows the model to identify spatial heterogeneity in seismic clustering and to construct probabilistic risk‑aware alerts via quantiles of the predicted distribution. A walk‑forward evaluation (2018‑2023) over four systems shows an 8.6 percent reduction in mean pinball deviation (MPD) relative to a negative binomial GLM baseline. The strongest improvements are observed in the tail regime (Y >= 5), where the continuous ranked probability score (CRPS) of the proposed model is 12.5 percent lower than that of the baseline, indicating improved calibration in extreme‑event forecasting.

Authors:Elle Miller, Jayaram Reddy, Ayush Deshmukh, Trevor McInroe, David Abel, Oisin Mac Aodha, Sethu Vijayakumar
Title: roto 2.0: The Robot Tactile Olympiad
Abstract:
Tactile‑based reinforcement learning (RL) is currently hindered by fragmented research and a focus on over‑saturated orientation tasks. We introduce v2 of the Robot Tactile Olympiad (\textttroto 2.0), a GPU‑parallelised benchmark designed to standardise tactile‑based RL across four distinct robotic morphologies (16‑DOF to 24‑DOF). Unlike prior benchmarks, roto focuses on end‑to‑end "blind" manipulation, utilising only proprioception and tactile sensing without state information or distillation. We demonstrate a significant performance leap, with our blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state‑of‑the‑art speeds. By open‑sourcing our environments and robustly tuned baselines, we reduce the barrier to entry and enable researchers to prioritise fundamental algorithmic challenges over tedious RL tuning. Website: https://elle‑miller.github.io/roto/

Authors:Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B. Aditya Prakash, Haohan Wang
Title: TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization
Abstract:
Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM‑generated feedback, but the resulting prompts often become longer, accumulate narrow sample‑specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text‑space optimization. We formalize this view through representational inefficiency, a dual‑factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft‑penalty objective through regularized textual gradients, combining Dual‑Evidence Gradient Purification, Semantic Edit Regularization, and Regularization‑Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out‑of‑distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

Authors:Ziqi Wang, Qiang Liu, Nils Thuerey
Title: CRAFT: Conflict-Resolved Aggregation for Federated Training
Abstract:
The aggregation of conflicting client updates remains a fundamental bottleneck in federated learning (FL) under heterogeneous data distributions. Naive averaging can produce a global update that improves the global objective while conflicting with specific clients, causing degradation for those clients. In this work, we propose CRAFT (Conflict‑Resolved Aggregation for Federated Training), a new aggregation framework that treats the global update as a geometric correction problem. We formulate aggregation as finding the update closest to a reference direction while satisfying conflict‑free alignment constraints. We derive a closed‑form expression for the constrained optimization problem, avoiding the computational overhead of iterative solvers. Furthermore, we use a layer‑wise adaptation to address conflicts at varying feature granularities. We provide a theoretical analysis showing that CRAFT promotes a common‑descent structure and mitigates conflicts through its projection geometry. Extensive experiments on heterogeneous benchmarks demonstrate that CRAFT improves the accuracy of the global model while reducing performance disparity across clients compared with state‑of‑the‑art baselines. The source code for CRAFT is available at https://github.com/tum‑pbs/CRAFT.

Authors:Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, Pietro Gori
Title: Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls
Abstract:
In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation‑Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease‑specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: https://github.com/rlouiset/deep_ucsl.

Authors:Abdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor
Title: Divide and Contrast: Learning Robust Temporal Features without Augmentation
Abstract:
Self‑supervised learning for time‑series representation aims to reduce reliance on labeled data while maintaining strong downstream performance, yet many existing approaches incur high computational costs or rely on assumptions that do not hold across diverse temporal dynamics. In this work, we introduce Divide and Contrast (Di‑COT), an unsupervised framework that avoids data augmentation and multiple encoder passes by contrasting informative substructures within a window rather than individual timesteps. Di‑COT stochastically partitions each window into a small number of overlapping sub‑blocks per iteration, enabling efficient and meaningful contrast while mitigating false positives during temporal transitions. To further improve scalability, we adopt a contrastive objective whose computation depends on the batch size and the number of sub‑blocks, making loss computation independent of sequence length. Extensive experiments on six large‑scale real‑world datasets, as well as the UCR and UEA benchmarks, demonstrate that Di‑COT learns semantically structured and transferable representations, achieving state‑of‑the‑art performance on classification, clustering, kNN, and cross‑dataset transfer, while substantially reducing training time. The source code is publicly available at https://github.com/sfi‑norwai/Di‑COT.

Authors:Yongkang Liu, Zijing Wang, Mengjie Zhao, Ercong Nie, Mingyang Wang, Qian Li, Feiliang Ren, Shi Feng, Daling Wang, Hinrich Schütze
Title: ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning
Abstract:
This work presents \textscChunkFT, a memory‑efficient fine‑tuning framework that reformulates full‑parameter fine‑tuning around a dynamically activated working set. \textscChunkFT enables gradient computation for arbitrary sub‑tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub‑networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textscChunkFT in the deterministic setting. Empirically, we apply \textscChunkFT to fine‑tune Llama 3‑8B and Llama 3‑70B using a single RTX 4090‑24GB GPU and 2× H800‑80GB GPUs, respectively. Full‑parameter fine‑tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textscChunkFT in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT‑Bench show that \textscChunkFT consistently outperforms existing memory‑efficient baselines. Notably, \textscChunkFT achieves performance comparable to, and in some cases exceeding, full‑parameter fine‑tuning. Our repository is on https://github.com/misonsky/chunk.

Authors:Xixiang He, Qiyao Sun, Ao Cheng, Xingming Li, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu
Title: Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
Abstract:
Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near‑zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real‑time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58‑63% relative to GRPO and yields consistent accuracy gains of 4‑6 percentage points across all model scales, while maintaining generalization on the evaluated out‑of‑domain task. Code and datasets are available at https://qingyonghu.github.io/AVSPO.

Authors:Kesong Li, Yixuan Xu, Kuo-kun Tseng, Weiyi Lu, Kan Liu, Tao Lan
Title: Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
Abstract:
Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text‑to‑image generation. Existing studies are confined to denoising diffusion models while overlooking flow‑matching, and suffer from an objective mismatch when applying discrete NLP‑based DPO to regression‑based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow‑matching via a unified reverse‑time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text‑to‑image generation. Consequently, we propose Linear‑DPO, which replaces the aggressive sigmoid‑based utility function with a sustained linear utility and incorporates an EMA‑updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow‑matching model (SD3‑Medium) demonstrate the superiority of our approach over existing baselines.

Authors:Minh Hoang Nguyen, Dai Do, Huu Hiep Nguyen, Dung Nguyen, Kien Do, Hung Le
Title: Reviving Error Correction in Modern Deep Time-Series Forecasting
Abstract:
Modern deep‑learning models have achieved remarkable success in time‑series forecasting. Yet, their performance degrades in long‑term prediction due to error accumulation in autoregressive inference, where predictions are recursively used as inputs. While classical error correction mechanisms (ECMs) have long been used in statistical methods, their applicability to deep learning models remains limited or ineffective. In this work, we revisit the error accumulation problem in deep time‑series forecasting and investigate the role and necessity of ECMs in this new context. We propose a simple, architecture‑agnostic error correction model that can be integrated with any existing forecaster without requiring retraining. By explicitly decomposing predictions into trend and seasonal components and training the corrector to adjust each separately, we introduce the Universal Error Corrector with Seasonal‑Trend Decomposition (UEC‑STD), which significantly improves correction accuracy and robustness across 4 backbones and 10 datasets. Our findings provide a practical tool for enhancing forecasts while offering new insights into mitigating autoregressive errors in deep time‑series models. Code is available at https://github.com/DA2I2‑SLM/UEC‑STD.

Authors:Jiawen Dai, Yue Song
Title: Winfree Oscillatory Neural Network
Abstract:
Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus (S^1)^d through structured oscillatory interactions, combining phase‑based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze‑hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization‑based oscillatory architecture to scale competitively to ImageNet‑1K. Furthermore, on Maze‑hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state‑of‑the‑art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter‑efficient alternative to conventional neural architectures.

Authors:Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang
Title: DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation
Abstract:
Language‑conditioned manipulation policies typically process instructions and observations through shared network parameters. This task‑state entanglement provides a pathway for observation leakage ‑‑ networks learn scene‑to‑action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task‑specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task‑awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high‑dimensional policy weights is itself a challenging problem. We address it with a two‑stage hypernetwork whose refinement stage embeds the structure of gradient‑based optimization as a feed‑forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO‑90 and Meta‑World, with advantages that widen on complex, long‑horizon tasks ‑‑ and surpasses the large‑scale pretrained π_0 despite using no external pretraining data. On a real‑world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language‑generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few‑shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: https://github.com/ReNginx/DISC.

Authors:Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo
Title: Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
Abstract:
Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF‑optimal policy must prefer human‑preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state‑of‑the‑art performance. Code is available at: https://github.com/visitworld123/CPO.

Authors:Haozhe Jia, Pengyu Yin, Wenshuo Chen, Shaofeng Liang, Lei Wang, Bowen Tian, Xiucheng Wang, Nanqian Jia, Yutao Yue
Title: Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment
Abstract:
Physics‑informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce REPA‑P, a teacher‑free, architecture‑agnostic framework that aligns intermediate features with physical states using first‑principles residuals. REPA‑P attaches lightweight 1×1 projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing zero overhead. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA‑P accelerates convergence by up to 2×, reduces physics residuals by up to 66.4%, and improves out‑of‑distribution robustness by up to 49.3%, with consistent gains on both U‑Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output‑level physics losses. Code is available at [https://github.com/Hxxxz0/REPA‑P](https://github.com/Hxxxz0/REPA‑P).

Authors:Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh
Title: Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
Abstract:
Inference‑time guided sampling steers state‑of‑the‑art diffusion and flow models without fine‑tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre‑trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off‑manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict‑Aware Additive Guidance (g^\textcar), a lightweight and learnable method, which actively rectifies off‑manifold drift by dynamically detecting and resolving gradient conflicts. We validate g^\textcar across diverse domains, ranging from synthetic datasets and image editing to generative decision‑making for planning and control. Our results demonstrate that g^\textcar effectively rectifies off‑manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR‑guidance.

Authors:Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui
Title: The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering
Abstract:
Generative verifiers have emerged as a promising paradigm for step‑wise verification, but their verification behavior is often poorly calibrated: they may be under‑critical and miss erroneous steps, or over‑critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden‑state intervention. We uncover a verification‑specific hidden‑state signal: in step‑wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden‑state steering can directly modulate verifier strictness without fine‑tuning. However, uniform steering induces a trade‑off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample‑level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self‑consistency while requiring 4‑7x less inference compute. VerifySteer is also complementary to verification fine‑tuning, providing further gains on top of fine‑tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

Authors:Amit Roth, Ankur Samanta, Matan Halevy, Yoav Levine, Yonathan Efroni
Title: Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
Abstract:
Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in TextArena and release Hack‑Verifiable TextArena, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack‑verifiable‑environments/.

Authors:Ziang Song, Ying Jin, Emmanuel J. Candès
Title: Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference
Abstract:
Modern applications of conformal inference to multiple testing problems, such as outlier detection and candidate selection, often involve selecting test samples whose conformal p‑values fall below a threshold. The quality of such methods is often measured by the false discovery proportion (FDP), defined as the fraction of incorrect selections. Existing approaches typically control the expected value of the FDP, using methods such as the Benjamini‑Hochberg procedure. This approach fails to provide high‑probability bounds on the realized false discovery proportion and invalidates statistical guarantees if the rejection threshold is selected after inspecting the data. This paper establishes finite‑sample, distribution‑free upper bounds on the FDP that hold simultaneously over all possible rejection thresholds, enabling arbitrary post hoc selection of the threshold. Simultaneous validity is achieved by constructing a high‑probability envelope for the empirical distribution function of null conformal p‑values by sampling from their joint distribution. Furthermore, our framework allows practitioners to modulate the envelope's shape, thereby producing tight bounds in rejection regions of primary interest. We use this flexible approach to derive simultaneous FDP upper bounds for both outlier detection and conformal selection. We demonstrate through synthetic and real‑data experiments that the resulting bounds are both valid and substantially less conservative than those derived from existing approaches.

Authors:Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao
Title: AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
Abstract:
Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning‑heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic‑free refinement of GRPO that uses group‑level statistics to control both update magnitude and exploration. AGPO uses a shared probe‑derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust‑region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step‑wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5‑14B trained with AGPO outperforms PPO/GRPO under the same generated‑token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama‑3‑8B and Gemma‑2‑9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

Authors:Youngjoon Park
Title: Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification
Abstract:
Random forests aggregate trees by averaging leaf class distributions with uniform per‑tree weight, which flattens local tree expertise into a globally averaged boundary. To refine this boundary locally, we look for signals in how individual trees navigate the feature space around each sample. We observe that the structural pattern of each tree's root‑to‑leaf decision path ‑‑ where and how often the dominant class label flips along it ‑‑ carries such a signal, conditional on the tree's final decision and the regional context where the sample lies. We propose a class‑conditional ratio weighting that exploits this signal while guaranteeing zero expected class bias by construction, refining the ensemble decision near the boundary without trading one class against another. On 30 binary classification benchmarks (30 repeats), the proposed method yields a statistically significant accuracy improvement over RF (Wilcoxon p = 0.007), while weighted RF and the KNORA family do not reach significance (all p > 0.5). The gain is small (Δacc = +0.0011) but consistent across forest sizes from 100 to 1,000 trees, and regresses neither class (majority 0/30, minority 2/30) ‑‑ unlike KNORA‑Eliminate, which lifts minority recall at the cost of majority regressions on 8/30 datasets.

Authors:Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu
Title: Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting
Abstract:
Non‑stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture‑of‑Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context‑aware expert selection without requiring test‑time updates. Experiments on nine benchmarks demonstrate state‑of‑the‑art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone‑07/Dynamic‑TMoE.

Authors:Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Title: AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals
Abstract:
Self‑distillation enables language models to learn on‑policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token‑level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view‑specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task‑dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive‑View Self‑Distillation), a novel method of self‑distillation with multiple privileged‑information views, which reconstructs token‑level supervision by separating stable cross‑view consensus from view‑specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view‑specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single‑view self‑distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3‑8B and Qwen3‑4B, respectively. Moreover, on code‑generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3‑8B, AVSD outperforms the single‑view self‑distillation baseline by 2.4% on average.

Authors:Junseok Kim, Dohyeong Kim, Mineui Hong, Songhwai Oh
Title: Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning
Abstract:
Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal‑conditioned reinforcement learning (GCRL), where a generalist goal‑reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task‑endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy‑context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: https://rllab‑snu.github.io/projects/CTA/

Authors:Ali Ramlaoui, Alexandre Duval, Hannah Bull, Victor Schmidt, Hugues Talbot, Fragkiskos D. Malliaros, Joseph Musielewicz
Title: TriForces: Augmenting Atomistic GNNs for Transferable Representations
Abstract:
Machine learning interatomic potentials (MLIPs) achieve excellent accuracy when trained on large Density Functional Theory (DFT) data. To be useful in practice, they must often be adapted to target chemistries using small and expensive task‑specific datasets. However, MLIPs transfer inconsistently across domains, with representations that often loose accessible composition and structure information. To address this, we present TriForces, a model‑agnostic three‑stream framework that separates composition and structure information, combined with self‑supervised learning to preserve transferable representations. TriForces improves performance on MatBench and QM9 over baselines without needing DFT labels and enables efficient similar structure retrieval through its learned latent space. On OMat24, in limited‑data training regime, TriForces reduces energy MAE by 57% at 20K samples only and improves force MAE across sample sizes. We release pretrained TriForces variants across multiple MLIP architectures with code at https://github.com/Ramlaoui/triforces.

Authors:Meng Zhu, Quan Xiao, Weidong Min
Title: Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates
Abstract:
Optimization algorithms are core methods by which machine learning models iteratively minimize loss functions, update parameters, learn from data, and improve performance. Momentum SGD and AdamW represent two important optimization paradigms. AdamW produces stable updates and usually has strong robustness across training scenarios, but its generalization performance is sometimes weaker than that of momentum methods. Momentum SGD can often obtain better generalization after careful tuning, but it is more sensitive to gradient‑scale variation and hyperparameter settings. To balance the strengths and weaknesses of the two paradigms, this paper proposes Ada2MS, an optimization algorithm that achieves a smooth transition between AdamW‑like behavior and momentum‑SGD‑like behavior through continuous exponential interpolation between elementwise second‑moment estimates and global second‑moment estimates. On the visual tasks evaluated in this study, Ada2MS obtains competitive results under a unified optimizer‑comparison protocol. The code will be released at https://github.com/mengzhu0308/Ada2MS

Authors:Slim Barkallah, Luke Bailey, Kaiyue Wen, Mohammed Abouzaid, Tengyu Ma
Title: Pseudo-Formalization for Automatic Proof Verification
Abstract:
Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo‑Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo‑Formal proof is decomposed into self‑contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo‑Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research‑level mathematics, where it pareto‑dominates LLM‑as‑judge baselines on error‑finding precision and recall. To support future work, we release our research‑level proof verification benchmark ArxivMathGradingBench.

Authors:Lucky Verma
Title: Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Abstract:
Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention‑head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss‑landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight‑decay axis separates memorization, developmental grokking, and collapse. A near‑transition logistic fit localizes the memorization‑to‑developmental boundary at λ_c=0.0158 (95% CI [0.0109, 0.0200], N=210); a power‑law fit gives an empirical exponent ν=0.757 (CI [0.725, 0.799]). Reference exponents ν=1/2 and 3D Ising ν\approx 0.63 lie outside this empirical CI under our four‑bin grid, so we report ν as empirical and defer universality‑class identification to denser finite‑size‑scaling work. A horizon‑matched multi‑task replication (n=280, four modular operations) preserves the weight‑decay control pattern; a paired attention‑head re‑initialization experiment at λ=0.05 changes Phase‑2 amplitude (Cohen's d=‑1.190, n=10, p_t=4.5 × 10^‑3), while matched weight‑norm clipping does not. Three cross‑architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight‑decay‑controlled transition with architecture‑specific λ_c values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non‑attention experiments are scope probes, and architecture‑wide, language‑model, and universality‑class claims are out of scope.

Authors:Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang
Title: DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
Abstract:
Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem‑solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty‑driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over‑sharpened and over‑flattened digit distributions, respectively. In this paper, we make an in‑depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion‑distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto‑regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross‑entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer‑based numerical learning to floating‑point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating‑point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen‑2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU‑VCLab/DEL

Authors:Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li
Title: SDM: A Powerful Tool for Evaluating Model Robustness
Abstract:
Gradient‑based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high‑loss non‑adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non‑ground‑truth label probability upper bound and the ground‑truth label probability", and proposes a novel and powerful gradient‑based attack method named Sequential Difference Maximization (SDM). SDM establishes a three‑layer optimization framework of "cycle‑stage‑step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage‑wise sequential optimization. Experiments demonstrate that compared with previous state‑of‑the‑art methods, SDM not only achieves stronger attack performance but also exhibits superior cost‑effectiveness. The code is available at https://github.com/X‑L‑Liu/ICML‑SDM.

Authors:Panagiotis Koromilas, Theodoros Giannakopoulos, Mihalis A. Nicolaou, Yannis Panagakis
Title: Neural Collapse by Design: Learning Class Prototypes on the Hypersphere
Abstract:
Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method that contrasts prototypes on the unit hypersphere, and that closing the gap requires fixing each at its point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet‑1K, NTCE and NONL surpass CE accuracy, closely approximate NC (\geq 95%), and match CE's converged NC on 4/5 metrics in under 7.5% of its iterations, while SCL with fixed prototypes matches linear probing without the hours‑long classifier training phase. The learned geometry yields +5.5% mean relative improvement in transfer learning, up to +8.7% under severe class imbalance, and improved robustness to corruptions on ImageNet‑C. Our work recasts supervised learning as prototype learning on the hypersphere, with NC reached by design.

Authors:Ziyuan Gao
Title: MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery
Abstract:
Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP‑CL, a framework that performs online task structure discovery and structure‑aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer‑grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality‑specific LoRA adapters regularized by intra‑modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay‑free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP‑CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6× fewer parameters. Code is available at https://github.com/zygao930/MedCRP‑CL.

Authors:Fatemeh Pesaran zadeh, Seyeon Choi, Xing Han Lù, Siva Reddy, Gunhee Kim
Title: Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
Abstract:
Large language models (LLMs) have enabled web agents that follow natural language goals through multi‑step browser interactions. However, agents fine‑tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute‑inefficient due to noisy, redundant trajectories and long accessibility‑tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed‑budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target‑centered AXTree pruning that keeps only content around the ground‑truth action target, and we mitigate style mismatch for reasoning‑native models by replacing expert traces with model‑generated, style‑consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5‑7B, Gemma3‑4B, and Qwen3‑8B, Weasel improves out‑of‑domain performance while reducing training cost, producing roughly 9.7‑12.5× training speedups over standard fine‑tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

Authors:Junxi Chen, Junhao Dong, Xiaohua Xie
Title: Adaptive Probe-based Steering for Robust LLM Jailbreaking
Abstract:
Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe‑based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6% to 70%. Our code is available at https://github.com/fhdnskfbeuv/adaptiveSteering.

Authors:Siyuan Li, Youyuan Zhang, Fangming Liu, Jing Li
Title: Modality-Decoupled Online Recursive Editing
Abstract:
Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text‑only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross‑modal conflict, while sequential writes become entangled in a shared edit space and amplify long‑horizon interference, causing inter‑edit interference. To address these, we propose M‑ORE, a modality‑decoupled online recursive editor for lifelong MLLM adaptation. M‑ORE is derived from a unified proximal‑projection formulation and admits a closed‑form update with a Sherman‑Morrison recursion, yielding constant per‑edit overhead. It maintains module‑wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low‑rank edit subspace via a Sherman‑Morrison recursion to mitigate long‑horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M‑ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality‑efficiency scaling. Our code is publicly available at https://github.com/lab‑klc/M‑ORE.

Authors:Brown Zaz, Mar Gonzàlez I Català, Ferran Hernandez Caralt, Moshe Eliasof, Pietro Liò
Title: Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification
Abstract:
In the transductive setting, where the full graph is observed but node labels are only partially available, progress in semi‑supervised node classification has largely focused on architectural innovation. In this paper, we revisit an orthogonal axis: the training objective. We start from a simple observation: transductive models produce predictions for every node during training, including nodes without labels. These unlabeled‑node predictions may contain useful training signal, but standard supervised objectives discard them because no ground‑truth labels are available. Inspired by the decomposition of cross‑entropy into a label‑dependent alignment term and a label‑independent entropy term, we propose prediction confidence as a natural way to extract this signal in the absence of labels. This motivates Transductive Sharpening (TS): a loss‑level modification that minimizes prediction entropy on unlabeled nodes while counterbalancing this effect on labeled nodes. We evaluate Transductive Sharpening across a wide range of node‑classification benchmarks and observe consistent performance improvements without requiring any changes to the backbone architecture. Code is available at https://github.com/transductive‑sharpening/tunedGNN.

Authors:Krati Saxena, Tomohiro Shibata
Title: GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
Abstract:
Recommending safe and effective medication combinations from electronic health records (EHRs) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration (e.g., drug‑drug interactions, DDIs), but rarely achieve both while robustly suppressing noise. We present GraphDiffMed, a knowledge‑constrained medication recommendation framework built on dual‑scale Differential Attention v2. Differential attention is applied at both intra‑visit and inter‑visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning. Experiments on MIMIC‑III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance. We further find that the strongest‑performing configuration uses only demographic auxiliary features under our experimental setting. Overall, GraphDiffMed demonstrates that combining noise‑aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation. We open‑source our code at https://github.com/saxenakrati09/GraphDiffMed.

Authors:Emaad Khwaja, Chris Lettieri, Gerald Woo, Eden Belouadah, Marc Cenac, Guillaume Jarry, Enguerrand Paquin, Xunyi Zhao, Viktoriya Zhukov, Othmane Abou-Amal, Chenghao Liu, Ameet Talwalkar, David Asker
Title: Toto 2.0: Time Series Forecasting Enters the Scaling Era
Abstract:
We show that time series foundation models scale: a single training recipe produces reliable forecast‑quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open‑weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT‑Eval, the standard general‑purpose benchmark; and the recent contamination‑resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u‑muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.

Authors:Junjun Pan, Yixin Liu, Yu Zheng, Lianhua Chi, Alan Wee-Chung Liew, Shirui Pan
Title: CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection
Abstract:
Text‑attributed graph fraud detection (TAGFD) plays a critical role in preventing fraudulent activities on online social and e‑commerce platforms. However, to evade detection, fraudsters continuously evolve their camouflaging strategies by deliberately mimicking textual responses of benign users, thereby concealing their malicious purposes. This phenomenon, referred to as semantic camouflage, fundamentally undermines commonly relied assumptions on how structural and attribute cues can be exploited to identify fraudsters, and makes it difficult to spot fraudsters with unsupervised TAGFD. To bridge the gaps, we propose a Case‑Adaptive Multi‑cue Expert fRAmework (CAMERA) for unsupervised TAGFD. CAMERA employs an ego‑decoupled mixture‑of‑experts architecture, where each expert specializes in modeling a distinct type of fraud‑indicative cue. A context‑informed gating model is introduced to jointly consider the ego node representation and its local neighborhood context for adaptive integration of cues learned by different experts. Furthermore, CAMERA leverages the inherent rarity of fraudsters to support unsupervised one‑class learning with expert‑level objectives that encourage modeling dominant benign patterns, thereby enabling reliable unsupervised detection of camouflaged fraudsters. Experiments on 4 challenging datasets show that CAMERA consistently outperforms competitors, showing its effectiveness against semantically camouflaged fraudsters. Code available at https://github.com/CampanulaBells/CAMERA

Authors:He-Yang Xu, Pengyuan Zhang, Zongyuan Ge, Xiaoshuai Hao, Serge Belongie, Xin Geng, Yuxin Peng, Xiu-Shen Wei
Title: Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation
Abstract:
Fine‑grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high‑fidelity spatial perception, and constraint‑respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real‑world deployment. We introduce MetaFine, a diagnostic meta‑evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state‑of‑the‑art vision‑language‑action (VLA) models through this lens exposes severe dimension‑specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine‑grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real‑sim validation, using limited paired real‑world rollouts to calibrate scalable simulation‑based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.

Authors:Bariscan Bozkurt, Efe Ali Gorguner, Francesco Innocenti, Rafal Bogacz
Title: Normative Networks for Source Separation via Local Plasticity and Dendritic Computation
Abstract:
Blind source separation (BSS) is a natural framework for studying how latent causes may be recovered from sensory mixtures, but deriving online and biologically plausible algorithms for structured (i.e., constrained to known domains) and potentially correlated sources remains challenging. Recent work has derived neural networks for BSS from maximization of an entropy measure, yet its online implementations involve complex and nonlocal recurrent dynamics. Motivated by this perspective, we propose Predictive Entropy Maximization, which achieves competitive performance in BSS, using only local weight updates. The method employs a close approximation of an entropy measure, yielding an objective function with easily interpretable components. Minimizing this objective leads to a predictive neural architecture in which feedforward synapses follow an error‑driven rule (that can be realized through dendritic mechanisms), lateral inhibitory connections are learned with local Hebbian plasticity, and source‑domain constraints are enforced through simple output nonlinearities. We derive explicit spectral bounds on the surrogate error, characterizing when the approximation is accurate. Empirically, Predictive Entropy Maximization remains robust under increasing source correlation and observation noise, outperforms biologically plausible algorithms that rely on stronger independence or decorrelation assumptions, and remains competitive with exact determinant‑ and correlative‑information‑based baselines. These results show how local plasticity and adaptive lateral inhibition can emerge from maximizing a regularized second‑order entropy over structured source domains. Our implementation code is available at https://github.com/BariscanBozkurt/Predictive‑Entropy‑Maximization.

Authors:Serhii Zabolotnii
Title: Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation
Abstract:
Uniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy‑maximizing resampling, but its resampling weights depend on a local k‑nearest‑neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial‑maximization moment estimator can replace the plug‑in density rule without changing the surrounding MASEM architecture. The proposed PMM‑MASEM module computes shell spacings from nested k‑nearest‑neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing distribution departs from the flat Exp(1) regime; otherwise it falls back to the plug‑in/MLE rule. This fallback is essential: on a flat homogeneous manifold the plug‑in estimator is already the MLE, so PMM should not outperform it. A local Known‑DGP Monte Carlo experiment confirms this gate: the selector returns MLE on flat Exp(1) spacings and reduces density MSE by 22‑‑36% on asymmetric gamma and boundary‑spacing regimes. The evidence is not uniformly positive: PMM3 worsens a platykurtic uniform spacing law, and a lightweight resampling‑proxy experiment improves seven‑lobes coverage but degrades the sine and swiss‑roll proxies. The current evidence therefore supports an applicability‑boundary result rather than a general MASEM improvement claim.

Authors:Shuo Zhang, Rongqi Hong, Huifeng Zhang, Jian K. Liu
Title: Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding
Abstract:
Predicting protein‑ligand binding affinity remains intractable for multi‑domain proteins, where inter‑domain dynamics govern molecular recognition. Existing geometric deep learning methods typically treat proteins as monolithic static graphs, suffering from rigid‑body assumptions and aleatoric noise in flexible regions. To address this, we introduced HCLBind, a self‑supervised framework that decouples geometric representation learning from affinity regression. HCLBind leverages a general‑to‑specific pre‑training paradigm on the Q‑BioLiP database to learn a robust physical grammar of binding. We propose a novel hierarchical decoy strategy: the model learns local physicochemical constraints through protein coordinate perturbation in single‑domain proteins and global conformational geometry through inter‑domain rotation in multi‑domain complexes. Our hybrid architecture integrates a domain‑gated graph attention network and cross‑modal attention to explicitly prioritize domain interfaces. Furthermore, we employ LoRA on protein and ligand foundation models, ensuring efficient optimization while preserving evolutionary knowledge. Experiments on PDBBind demonstrate that HCLBind effectively learns discriminative interface features and provides robust uncertainty estimation, overcoming the limitations of standard supervised learning. The code is available at https://github.com/jiankliu/HCLBind.

Authors:Hongjiang Chen, Xin Zheng, Pengfei Jiao, Huan Liu, Zhidong Zhao, Huaming Wu, Feng Xia, Shirui Pan
Title: ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability
Abstract:
Temporal graph neural networks (TGNNs) have gained significant traction for solving real‑world temporal graph tasks. However, their interpretability remains limited, as most TGNNs fail to identify which historical interactions most influence a given prediction. Despite promising progress on interpretable TGNNs, existing methods predominantly focus on previously seen historical interactions, which we term stability patterns, while overlooking newly emerging first‑time interactions, which we term transition patterns. Both types of patterns are essential for faithful temporal explanations. To address this limitation, we propose ST‑TGExplainer, a self‑explainable TGNN that disentangles Stability and Transition patterns in temporal graphs for a more faithful Temporal GNN Explainer. Guided by a disentangled information bottleneck objective, ST‑TGExplainer learns a compact explanatory subgraph that remains predictive of the event label while explicitly suppressing label‑conditioned redundancy between stability and transition patterns. Extensive experiments demonstrate that ST‑TGExplainer achieves strong predictive performance and yields more faithful explanations. Code is available at https://github.com/hjchen‑hdu/ST‑TGExplainer.

Authors:Arman Bolatov, Artem Riabinin, Nikita Kornilov, Andrey Veprikov, Samuel Horváth, Martin Takáč, Aleksandr Beznosikov
Title: LionMuon: Alternating Spectral and Sign Descent for Efficient Training
Abstract:
In large‑scale optimization, the cheapness and effectiveness of update steps are the most crucial factors for a successful optimizer. Sign‑based optimizers like Lion or Signum produce cheap per‑step updates, whereas Muon's spectral matrix‑sign update gives a much stronger direction at a substantially higher per‑step cost. In this work, we propose LionMuon, which retains the effectiveness of Muon steps while considerably cutting the averaged iteration cost, similar to sign‑based methods. It alternates between Lion's and Muon's updates on a fixed period P, sharing a single dual‑EMA momentum buffer between them. The optimizer state memory therefore matches Lion and is exactly half of AdamW's. A simpler single‑EMA variant, SignMuon, by itself already outperforms pure Muon. At P = 2, LionMuon Pareto‑dominates Muon, Lion, Signum, and AdamW on every dataset and architecture we tested at 124M model size, reaching lower validation loss at lower compute, and the same advantage persists at 355M and 720M scale. On the theory side, we prove sharp complexity bounds under heavy‑tailed noise which are governed by period‑averaged smoothness and noise that interpolate between Muon's and Lion's constants. These bounds predict the compute‑optimal period and the conditions under which LionMuon outruns Muon and Lion. Code: https://github.com/brain‑lab‑research/lion‑muon

Authors:Zinuo You, Jin Zheng, John Cartlidge
Title: Latent Laplace Diffusion for Irregular Multivariate Time Series
Abstract:
Irregular multivariate time series impose a trade‑off for long‑horizon forecasting: discrete methods can distort temporal structure via re‑gridding, while continuous‑time models often require sequential solvers prone to drift. To bridge this gap, we present Latent Laplace Diffusion (LLapDiff), a generative framework that models the target as a low‑dimensional latent trajectory, enabling horizon‑wide generation without step‑by‑step integration over physical time. We guide the reverse process utilizing a stable modal parameterization motivated by stochastic port‑Hamiltonian dynamics, and parameterize its mean evolution in the Laplace domain via learnable complex‑conjugate poles, enabling direct evaluation over irregular timestamps. We also link continuous dynamics to irregular observations through renewal‑averaging analysis, which maps sampling gaps to effective event‑domain poles and motivates a gap‑aware history summarizer. Extensive experiments show that LLapDiff improves over baselines in long‑horizon forecasting, and its continuous‑time generative nature supports missing‑value imputation by querying the same model at historical timestamps. Code is available at https://github.com/pixelhero98/LLapDiffusion.

Authors:Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat, Li Mi, Zhaochong An, Zixiang Zhao, Dominik Narnhofer, Serge Belongie, Federico Tombari, Konrad Schindler
Title: Stitched Value Model for Diffusion Alignment
Abstract:
For practical use, diffusion‑ or flow‑based generative models must be aligned with task‑specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie‑style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel‑space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel‑space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT‑L and SD 3.5 Medium takes only 10 GPU‑hours. By lifting powerful pixel‑space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per‑sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post‑training methods: DPS becomes 3.2× faster while halving peak GPU memory, and DiffusionNFT becomes 2.3× faster.

Authors:Hao Li, Lifu Du, Nurul Hameed, Shemonti Saha Authai, Zlata Stefanovic, Chenjie Xu
Title: Agentic Discovery of Cryomicroneedle Formulations
Abstract:
Cryomicroneedles offer a route to minimally invasive intradermal delivery of living cells, but their cryogenic formulations must reconcile cell protection with constraints on toxicity and device fabrication. Here we report an AI‑assisted, closed‑loop workflow for cryomicroneedle cryoprotectant discovery that combines literature curation, Gaussian‑process surrogate modelling, Bayesian optimization, and sequential wet‑lab validation. A curated dataset of 198 mesenchymal stem‑cell cryopreservation formulations from 42 studies was converted into 21 ingredient features and used to train an uncertainty‑aware literature prior. This model captured moderate structure in the literature data but failed prospectively, motivating iterative wet‑lab correction. Across ten validation iterations and 106 wet‑lab observations, the model progressively adapted to cryomicroneedle‑specific outcomes: batch RMSE decreased from 41.21 to 6.86 percentage points, later‑stage rank correlations became consistently positive, and the cumulative wet‑lab predicted‑versus‑measured summary reached R^2 = 0.942. The best validated formulation achieved 95.15% post‑thaw viability with low DMSO, ectoin, ethylene glycol, and fetal bovine serum. However, high viability alone did not ensure intact cryomicroneedle formation, highlighting the need for future multi‑objective optimization. These results demonstrate that agent‑assisted computational infrastructure can make data‑efficient formulation discovery more accessible to labs with minimal data expertise in‑house. Project code is available at https://github.com/baitmeister/ML‑for‑CryoMN.

Authors:Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong
Title: OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
Abstract:
The rapid advancement toward long‑context reasoning and multi‑modal intelligence has made the memory footprint of the Key‑Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per‑channel quantization effectively accommodates intrinsic channel‑wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per‑channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni‑Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X‑LLMs (i.e., text‑only, multi‑modal, and omni‑modal LLMs). Advancing the per‑channel paradigm, OScaR employs Canalized Rotation followed by Omni‑Token Scaling to mitigate TNI‑induced sequence‑dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X‑LLMs show that OScaR consistently outperforms existing methods and achieves near‑lossless performance under INT2 quantization, establishing it as a robust, low‑complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding‑v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR‑KV‑Quant.

Authors:Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia
Title: optimize_anything: A Universal API for Optimizing any Text Parameter
Abstract:
Can a single LLM‑based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI‑based optimization system‑supporting single‑task search, multi‑task search with cross‑problem transfer, and generalization to unseen inputs‑achieves state‑of‑the‑art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC‑AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score‑only feedback, and that multi‑task search outperforms independent optimization given equivalent per‑problem budget through cross‑task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM‑based search is a general‑purpose problem‑solving paradigm, unifying tasks traditionally requiring domain‑specific algorithms under a single framework. We open‑source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa‑ai/gepa .

Authors:Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi
Title: Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution
Abstract:
Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight‑line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline‑to‑input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine‑grained details, naturally following a coarse‑to‑fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path‑based attribution methods. Our code is available at https://github.com/leekwoon/sig/.

Authors:Jianan Ma, Jingyi Wang, Qi Xuan, Zhen Wang
Title: Provable Fairness Repair for Deep Neural Networks
Abstract:
Deep neural networks (DNNs) are suffering from ethical issues such as individual discrimination. In response, extensive NN repair techniques have been developed to adjust models and mitigate such undesired behaviors. However, existing fairness repair methods are typically data‑centric, which often lack provable guarantees and generalization to unseen samples. To overcome these limitations, we propose ProF, a novel fairness repair framework with provable guarantees. The key intuition of ProF is to leverage interval bound propagation (a widely used NN verification technique) to soundly capture model outputs over the whole set S(\mathbfx) around a biased sample \mathbfx. The derived bounds are utilized to guide fairness repair which encourages the model to produce consistent outputs on S(\mathbfx). Specifically, we integrate fairness constraints and model modifications into a unified constraint‑solving formulation, which can be transformed to a Mixed‑Integer Linear Programming (MILP) problem solvable by off‑the‑shelf solvers. The solution to the MILP problem effectively induces a repaired model with guaranteed fairness over the whole set S(\mathbfx). We evaluate ProF on four widely used benchmark datasets and demonstrate that it achieves provable fairness repair, with generalization of up to 95.93% on full datasets and 93.16% on the entire input space. Notably, ProF can be easily configured to support multiple sensitive attributes and more practical fairness definitions, while providing provable repair guarantees and delivering around 90% fairness improvement. Our code is available at https://github.com/nninjn/ProF.

Authors:Carlo Romeo, Andrew D. Bagdanov
Title: ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
Abstract:
Reinforcement learning for legged locomotion has matured into a stack of multi‑component reward functions and physics‑engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim‑to‑real robotics and routinely take the form of creatures with no real‑robot counterpart. We introduce ARC‑RL, a suite of four MuJoCo continuous‑control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18‑DoF tall hexapod Queen, the 12‑DoF armoured hexapod Bastion, the 18‑DoF compact hexapod Tick, and the 12‑DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed‑form multi‑component reward function whose only per‑morphology variation lives in a small set of weights and parameters. The reward fuses a velocity‑tracking tent, a healthy survive bonus, a phase‑locked gait‑compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion‑capture data enters the reward at any point. We additionally provide hand‑crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline‑to‑online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE‑EO) and methods augmented with prior data (SACfD, SPEQ‑O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation‑style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

Authors:Daisuke Oba, Hiroki Furuta, Naoaki Okazaki
Title: Drifting Objectives for Refining Discrete Diffusion Language Models
Abstract:
Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling‑time correction can instead be absorbed into training through an anti‑symmetric fixed‑point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non‑differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft‑token features, applies anti‑symmetric drifting in a frozen semantic space, and backpropagates the resulting stop‑gradient feature target to DDLM logits. In controlled continual‑training experiments with masked and uniform‑state diffusion backbones, TokenDrift improves fixed‑NFE generation quality over matched continuation baselines, reducing Gen.‑PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.

Authors:Noam Major, Kathy Razmadze, Yoli Shavit
Title: Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models
Abstract:
The success of self‑supervised learning (SSL) in vision and NLP has motivated its rapid adoption for time series. However, research has focused primarily on Generative paradigms and forecasting tasks, leaving the broader utility of learned representations unquantified. We establish a controlled framework to evaluate the "pre‑training dividend": the value added by SSL across diverse temporal tasks. We systematically compare Generative paradigms against Latent Alignment architectures, introducing adaptations of LeJEPA and DINO for time series. These adaptations utilize Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Our analysis reveals that the pre‑training dividend is highly asymmetric: SSL yields gains of up to 375% for anomaly detection and classification, yet remains marginal for forecasting. We demonstrate that representational utility is non‑universal, governed by a precision‑invariance trade‑off where the specific signal resolution required by the task must align with the objective. Finally, we show that representation quality is largely independent of data origin and saturates at moderate architectural depths, suggesting a path to scaling via massive synthetic generation. Our code is available at: https://github.com/noammajor/Models

Authors:Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang
Title: When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window
Abstract:
Test‑time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo‑label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already‑solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per‑problem tracking reveals that correct‑answer signals in low‑ability problems are briefly active before being permanently suppressed, a phenomenon we term the Correct‑Answer Extinction Window, with Flip Rate (FR) as its leading indicator. We thus propose TTRL‑Guard, a lightweight framework with three mechanisms targeting the extinction window: Flip‑Rate‑Aware Reward Scaling (FRS) down‑weights at‑risk updates as FR declines, Minority‑Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk‑Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL‑Guard achieves the best average pass@1 on Qwen2.5‑7B‑Instruct and Qwen3‑4B, improves relatively over TTRL by +54% on AIME 2025. \footnoteOur code and implementation details are available at https://github.com/linhxkkkk/TTRL‑Guard.

Authors:Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan
Title: CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
Abstract:
When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong‑answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution‑matching self‑distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

Authors:Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin
Title: Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings
Abstract:
Vision‑language alignment using chest X‑rays and radiology reports has emerged as an advanced paradigm for zero‑shot classification and grounding of chest X‑ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero‑shot understanding tasks. To address this challenge, we propose CoNNS, a concept‑guided noisy‑negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross‑patient pair relabeling strategy comprising three steps: (1) Fine‑Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept‑Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi‑granularity zero‑shot grounding tasks and five zero‑shot classification datasets validate that CoNNS outperforms existing state‑of‑the‑art models. The code is available at https://github.com/DopamineLcy/conns.

Authors:Halil Ibrahim Gulluk, Olivier Gevaert
Title: MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
Abstract:
Deep learning methods have demonstrated promising results in predicting BI‑RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi‑modal model that uses a pretrained PubMedBERT as the language component. By training this model on image‑text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine‑tune the vision encoder on two datasets for BI‑RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3‑class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image‑text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision‑language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre‑trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM‑CLIP

Authors:Soojin Choi, Seokhyeon Hong, Chaelin Kim, Junghyun Nam, Junhyuk Jeon, Junyong Noh
Title: Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance
Abstract:
Retargeting motion across characters with varying body shapes while preserving interaction semantics, such as self‑contact and near‑body proximity, remains a challenging problem. While recent geometry‑aware approaches address this by maintaining spatial relationships between predefined corresponding regions, their reliance on static correspondences often struggles when the target character exhibits exaggerated body proportions. In this paper, we present a geometry‑aware motion retargeting framework that preserves interaction semantics by performing proximity matching over spatially adaptive anchors. Unlike prior methods with static anchor definitions, the proposed method dynamically repositions anchors to reachable regions on the target character. This is achieved via a Transformer‑based anchor refinement strategy that predicts anchor displacements and constrains the translated anchors to remain on the target character geometry through differentiable soft projection. By incorporating pose‑dependent spatial structures from the source character, the adapted anchors provide structurally coherent guidance for interaction‑aware retargeting. Conditioned on these anchors, a graph‑based autoencoder predicts target skeletal motion that preserves the spatial configuration of the source. To encourage task‑aligned optimization between anchor adaptation and motion retargeting, we adopt an alternating training scheme in which each module is optimized in turn. Through extensive evaluations, we demonstrate that our method outperforms state‑of‑the‑art approaches in preserving interaction fidelity across diverse character geometries.

Authors:Joy Bose
Title: IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis
Abstract:
We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata‑derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.

Authors:Qiujing Lu, Tonmoy Monsoor, Ehsan Ebrahimzadeh, Kartik Sharma, Vwani Roychowdhury
Title: An Exterior Method for Nonnegative Matrix Factorization
Abstract:
Nonnegative matrix factorization (NMF) seeks a low‑rank approximation X \approx UV^T with nonnegative factors and is commonly solved using interior methods that enforce feasibility throughout optimization. We show that such constraint‑driven approaches can impede progress in the nonconvex landscape, leading to slow convergence or convergence to suboptimal stationary points. We propose an exterior framework for NMF (eNMF) that separates low‑rank approximation from nonnegativity enforcement. Our method initializes from the optimal unconstrained factorization and introduces a rotation procedure that maps unconstrained factors to an exterior point closest to the nonnegative orthant. This viewpoint yields an algorithmic framework in which simple iterative updates converge to KKT‑satisfying stationary points on the boundary of the positive orthant. The exterior formulation also enables a geometric interpretation of NMF solutions, clarifying equivalence classes of factorizations under permutation and orthogonal transformations. An intriguing numerical result, involving 400 NMF experiments across both real and synthetic datasets, show that in 99% of the cases, different algorithms tend to converge towards equivalent factor matrices. We benchmark eNMF against 9 state‑of‑the‑art NMF algorithms with 9 initialization schemes across 3 real‑world and 2 synthetic datasets. eNMF consistently outperforms all 81 competitors, achieving up to 30% lower reconstruction error under equal‑time settings and up to 150% speedup under equal‑error settings. The downstream experiments further demonstrate substantial performance gains in audio processing and recommendation tasks, corroborating the practical benefits of the proposed exterior optimization framework. Code is available at https://github.com/roychowdhuryresearch/eNMF

Authors:Taegu Kang, Jaesik Yoon, Sungjin Ahn
Title: Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement
Abstract:
Inference‑time scaling has emerged as a major approach for improving reasoning capabilities, and has been increasingly applied to diffusion models. However, existing inference‑time scaling methods for diffusion models typically rely on external verifiers or reward models to rank and select samples, limiting their scalability to settings where such evaluators are available and reliable. Moreover, while recent diffusion models perform sequential inference with region‑wise, mixed‑noise conditioning, inference‑time scaling tailored to this setting remains relatively underexplored. We propose Iterative Partial Refinement (IPR), an inference‑time scaling method for sequential diffusion that requires no external verifier. Starting from an already‑generated sample, IPR re‑noises a subset of regions and regenerates them conditioned on the remaining regions, enabling the model to revise earlier decisions under a richer context than was available during the initial generation. This iterative partial refinement produces more globally consistent samples without external verification. On reasoning tasks requiring global constraint satisfaction, IPR consistently improves performance: on MNIST Sudoku, the valid solution rate increases from 55.8% to 75.0%. These results show that iterative partial refinement alone can serve as an effective inference‑time scaling strategy for diffusion models in sequential, mixed‑noise settings. Code is available at: https://github.com/ahn‑ml/IPR

Authors:Thomas Vincent Howe, David Wingate
Title: Language models struggle with compartmentalization
Abstract:
In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.

Authors:Omer Haq
Title: EviTrack: Selection over Sampling for Delayed Disambiguation
Abstract:
Sequential prediction is challenging in regimes of delayed disambiguation, where early observations are ambiguous and multiple latent explanations remain plausible until sufficient evidence accumulates. Standard approaches based on marginal inference struggle in this setting, either collapsing uncertainty prematurely or failing to recover once informative evidence arrives. We introduce EviTrack, a test‑time inference framework that operates over latent trajectories rather than marginal states. EviTrack maintains a set of competing trajectory hypotheses and applies evidence‑ and likelihood‑ratio‑based selection to delay commitment until supported by data, drawing inspiration from hypothesis management in multiple hypothesis tracking and track‑before‑detect. To evaluate this setting, we construct a controlled synthetic benchmark with known latent ground truth that explicitly exhibits delayed disambiguation. At matched inference budget, EviTrack substantially outperforms sampling‑based baselines, achieving faster post‑disambiguation recovery. These results show that, in delayed disambiguation regimes, moderate trajectory‑level selection is more effective than increasing sampling coverage, highlighting selection over sampling as a key principle for reliable sequential inference.

Authors:Jianan Yang, Yiran Wang, Shuai Li, Fujun Cao, Xuefei Yan, Junmin Liu
Title: From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models
Abstract:
Physics‑informed neural networks (PINNs) offer a mesh‑free framework for solving partial differential equations (PDEs), yet training often suffers from gradient pathologies, spectral bias, and poor convergence, especially for problems with strong nonlinearity, sharp gradients, or multiscale features. We propose the Curriculum‑Guided Gaussian Mixture Physics‑Informed Neural Network (CGMPINN), which integrates Gaussian mixture modeling with dynamic curriculum learning. Specifically, a GMM is periodically fitted to the PDE residual distribution to quantify spatially varying learning difficulty. A smooth curriculum schedule progressively shifts training focus from easy to harder regions, while precision‑based variance modulation suppresses unreliable clusters during early optimization. This dual curriculum is governed by a shared curriculum parameter and can be combined with self‑adaptive loss balancing. We further establish theoretical guarantees, including sublinear convergence of the gradient norm for the induced time‑varying loss, uniform equivalence between the curriculum‑weighted and standard PDE losses, and a generalization bound with an explicit weighting‑induced bias characterization. Experiments on six benchmark PDEs spanning elliptic, parabolic, hyperbolic, advection‑dominated, and nonlinear reaction‑diffusion types show that CGMPINN consistently achieves the lowest relative L_2 and maximum absolute errors among all compared methods, reducing relative L_2 error by up to 97.8% over the standard PINN at comparable cost. Our code is publicly available at https://github.com/Mathematics‑Yang/CGMPINN.

Authors:Carlos A. Durán Paredes, Javier E. León Calderón, Nicolás Sánchez Perea, German Darío Díaz, Camilo Segura Quintero
Title: Quantum Machine Learning for Cyber-Physical Anomaly Detection in Unmanned Aerial Vehicles: A Leakage-Free Evaluation with Proxy-Audited Feature Sets
Abstract:
Unmanned aerial vehicles (UAVs) are cyber‑physical systems whose attack surface spans networked avionics and on‑board sensor fusion: a compromised GPS or battery module can mimic a benign mission segment and evade naive anomaly detectors. We present a leakage‑free evaluation of quantum machine learning for UAV anomaly detection on the multi‑sensor TLM:UAV benchmark. Three contributions support the study. (i) A group‑aware temporal protocol (B2) partitions the dataset into ten contiguous TimeUS blocks and evaluates over ten seeds, eliminating the inflation produced by random stratified splits that mix neighbouring samples. (ii) A three‑mode feature audit (full/loose/strict) quantifies how much accuracy stems from instantaneous physical signals versus contextual proxies (cumulative energy, battery state, GPS trajectory). (iii) A hybrid XGBoost + Data Reuploading (DRU) classifier is benchmarked against five paired non‑linear controls (raw, PCA, polynomial‑2, random‑RBF, and an untrained DRU map) under identical budgets. The standalone DRU does not consistently match the strongest classical baseline across seeds; however, the trained‑DRU hybrid is the only model whose mean F1 macro shifts upward from full to strict (+0.05), a directional signal that the per‑seed standard deviations prevent from being interpreted as a statistically established difference. The trained‑DRU hybrid also records the lowest mean false‑alarm rate under proxy‑free evaluation, subject to the inter‑seed variance reported. We frame this as an incremental, reproducible quantum‑enhanced hybrid benefit, and provide an open Qiskit 2.x implementation as a benchmark for cybersecurity analytics in NISQ‑era aerospace systems.

Authors:Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee
Title: RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
Abstract:
Supervised open‑loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi‑agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement‑learning‑based fine‑tuning framework that enhances scenario realism by aligning simulator rollouts with real‑world data distributions and provides a method for distilling goal‑conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre‑trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state‑of‑the‑art performance. Compared with other heuristic search‑based fine‑tuning methods, RLFTSim requires significantly fewer samples due to a proposed low‑variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan‑ami.github.io/rlftsim.

Authors:Ali Zindari, Xiaowen Jiang, Rotem Mulayoff, Sebastian U. Stich
Title: Learning When to Adapt
Abstract:
Low‑rank adaptation (LoRA) is a widely used parameter‑efficient fine‑tuning method, yet its learned correction is static: the same low‑rank update is applied to every input. This input‑agnostic approach creates an inevitable compromise between adapting to the fine‑tuning distribution and preserving pre‑trained behavior on inputs outside that distribution, contributing to catastrophic forgetting. We introduce DISeL (Dynamic Input‑Sensitive LoRA), which augments LoRA modules with lightweight input‑dependent gates over individual rank‑one components. The gating mechanism is designed to preserve the pre‑trained model's behavior by default, while training learns to activate selected components that reduce the fine‑tuning loss. DISeL adds only a small number of parameters and preserves the low‑rank structure. Across RoBERTa on GLUE, and Llama and Mistral models fine‑tuned for mathematical reasoning and code generation, DISeL reduces forgetting relative to LoRA and related variants while maintaining competitive fine‑tuning accuracy. In addition, the learned gate activations provide an interpretable diagnostic view of which layers and rank components are most activated during fine‑tuning, giving insight into where task‑specific adaptation is concentrated. Code available at https://github.com/alizindari/DISeL .

Authors:Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov, Hafize Gonca Cömert
Title: SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction
Abstract:
Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long‑range nonlinear structure. We propose SAGA, a decoder‑only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual‑level prediction intervals with finite‑sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person‑years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present‑discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten‑year horizon and mean absolute error by 37.7 percent at the twenty‑year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst‑case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.

Authors:Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon, Kun Qian, Junfeng Jiao, Christian Claudel
Title: EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction
Abstract:
Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real‑world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real‑world urban environments. Each recording provides synchronized RGB video along with ground‑truth data, including continuous time‑synchronized 6‑degree‑of‑freedom head poses, per‑frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long‑horizon, self‑directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state‑of‑the‑art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR‑based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

Authors:Tiexin Ding
Title: A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
Abstract:
We apply the Weibull distribution ‑‑ a two‑parameter family from extreme‑value theory ‑‑ as a diagnostic framework for element‑wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle‑80% probability‑plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture‑independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per‑component, per‑layer, and per‑step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo‑1/2, LLaMA‑3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o ‑‑ the Transmission Class ‑‑ fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross‑family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre‑LN/QK‑Norm placements, and 70M‑14B sizes. Second, the attention input projections W_q, W_k ‑‑ the Selection Class ‑‑ depart from the Weibull family, with severity shaped by storage: separately‑stored Q/K (OLMo‑1, OLMo‑2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm‑weibull‑py v0.4 (Python library) and DATABASE_v9_1 at https://github.com/tiexinding/NPM‑Weibull‑public .

Authors:Wei Shi, Ziheng Peng, Sihang Li, Xiting Wang, Xiang Wang, Mengnan Du, Na Zou
Title: To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents
Abstract:
LLM agents exhibit a consistent tendency to over‑call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no‑call accuracy, leaving overall accuracy in the 55%‑70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no‑call decision mapping carries an activation‑independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior‑aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision‑neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin‑Calibrated Steering (AMCS), a closed‑form counter‑bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over‑calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over‑calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at https://github.com/SKURA502/agent‑sae/.

Authors:Yujie Lin, Chengyi Yang, Zhishang Xiang, Yiping Song, Jinsong Su
Title: ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models
Abstract:
Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine‑tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re‑mapping problem via model editing. We propose ZeroUnlearn, a few‑shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed‑form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient‑based variant for multi‑sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.

Authors:Chanuk Lee, Minki Kang, Sung Ju Hwang
Title: SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs
Abstract:
Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse‑KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward‑KL provides a satisfactory solution, as both disrupt the efficiency‑coverage trade‑off by either inducing reward hacking or allocating probability mass to off‑target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse‑KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.

Authors:Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi
Title: TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
Abstract:
LLM routing matters most in long‑horizon applications such as coding agents, deep research systems, and computer‑use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one‑shot prompts. They never expose the router‑visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step‑level routing benchmark with two tracks. The static track provides 970 router‑visible prefixes from 520 instances across SWE‑bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution‑verified target tier estimated under a released downgrade‑and‑cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator‑side LLM judge. The dynamic track supplies a harness that runs routers on the full 500‑case SWE‑bench Verified suite; in this paper we report a 100‑case held‑out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end‑to‑end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

Authors:Vyzantinos Repantis, Harshvardhan Singh, Tony Joseph, Cien Zhang, Akash Vishwakarma, Svetlana Karslioglu, Michael Wyatt Thot, Ameya Gawde
Title: The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection
Abstract:
For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits‑over‑Random (BoR), a chance‑corrected measure of retrieval selectivity that reveals when high success rates mask random‑level performance. We measure selectivity as BoR = \log_2\left(\frac\mathrmP_obs\mathrmP_rand\right), where \mathrmP_rand is the hypergeometric baseline for the chosen success rule (here, coverage: \geq1 relevant in top‑K). On the 20 Newsgroups dataset, BM25 and SPLADE both report >99% success at K=100 (coverage), yet BoR \approx 0, indicating random‑level selectivity at that depth. When the expected coverage ratio \left(\fracK \cdot \barR_qN\right) exceeds 3‑5, the baseline dominates and selectivity collapses. Downstream retrieval‑augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at K=100, consistent with the near‑zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13‑point recall gap), confirming baseline predictions across sparse and large‑scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.

Authors:Cheng Luo, Zefan Cai, Junjie Hu
Title: Delta Attention Residuals
Abstract:
Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross‑layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low‑contrast and closer to uniform (max weight \approx0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer‑wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas ‑‑ the change introduced by each sublayer (\mathbfv_i = \mathbfh_i+1 ‑ \mathbfh_i) ‑‑ instead of cumulative states. Delta representations are structurally diverse and yield higher‑contrast attention distributions (max weight \approx0.6), enabling more selective and effective routing across layers. This principle applies at both per‑sublayer and block granularity. Across all tested scales (220M‑‑7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7‑‑8.2% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine‑tuning. Code is available at https://github.com/wdlctc/delta‑attention‑residuals‑code.

Authors:Adil Amin
Title: The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
Abstract:
Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases ‑‑ and at the frontier, this interaction is the more informative signal. We decompose paired SWE‑bench and GPQA Diamond scores into a population coupling trend and per‑release residual (h‑field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024‑‑2026), capabilities cooperate (r = +0.72, p < 10^‑6), but cooperation varies by lab and over time: DeepSeek reversed from reasoning‑rich to coding‑first (h: +11.2 \to ‑4.7, 15.9‑pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static ‑‑ it cascades. Six open‑weight architectures confirm a second capability transition at 30‑‑72B, and SWE‑bench is now saturating while HLE and instruction‑following retain discriminatory spread ‑‑ signaling the next axis rotation. We provide a three‑level playbook (locate, diagnose, rotate), a per‑lab measurement‑priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases. Per‑lab coupling slopes vary 5× (Google 1.15 vs. DeepSeek 0.23), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample (r rises from +0.72 to +0.75). An interactive dashboard provides phase classification with actionable recommendations, h‑field diagnostics, per‑lab coupling trajectories, ODE‑based scaling predictions, benchmark rotation guidance, self‑steering demo, and live tracking of all seven predictions: https://zehenlabs.com/cape/.

Authors:Adil Amin
Title: Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
Abstract:
Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family‑dependent critical scale N_c, capabilities anticorrelate; above it, they cooperate. N_c \approx 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 \to 0.830 at matched scale), Gemma‑4 at 4B achieves coupling 0.871, characteristic of 13B+ standard‑trained models, through distillation and architectural innovation, and Phi at 1B matches web‑trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output‑projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse‑regression ODE cross‑predicts held‑out Llama‑2 at 5.6% error. The diagnostic requires no model internals ‑‑ only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). Code, data, and an open‑source activation‑steering tool for any open‑weight model are released alongside an interactive dashboard that diagnoses any model's coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: https://zehenlabs.com/cape/.

Authors:Yuanqing Wang, Yuchen Zhang, Hao Lin, Junhao Hu, Chunyang Zhu, Quanlu Zhang, Boxun Li, Guohao Dai, Zhi Yang, Daning Cheng, Yunquan Zhang, Yu Wang
Title: DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
Abstract:
Modern large language model (LLM) training is inherently dynamic: resource fluctuations, RLHF phase shifts, and cluster elasticity continually reshape the optimal parallelism layout, posing a significant challenge to existing training frameworks built around a static execution model. We present DynaTrain, a distributed training system for sub‑second, online reconfiguration across arbitrary multi‑dimensional parallelism. At its core, we propose a Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space, turning any parallelism configuration into a deterministic mapping and collapsing complex transition into manageable geometric intersections. On top of VPS, a state routing‑and‑transition layer executes rank‑local transfers under a memory‑aware, deadlock‑free schedule, and an Elastic Device Manager overlaps new‑world construction with ongoing training to mask topology‑change cost. On dense and MoE models up to 235B parameters, DynaTrain reconfigures a 70B dense model in under 2s and a 235B MoE model in 4.36s, outperforming state‑of‑the‑art checkpoint‑based and elastic systems by up to three orders of magnitude while preserving correctness.

Authors:Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai
Title: ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
Abstract:
Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter‑turn correctness‑transition problem rather than a final‑answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition‑aware reinforcement learning framework that decomposes Initial‑to‑Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail‑adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5‑4B and from 45.40 to 55.59 on Qwen3.5‑9B. Ablations show that final‑answer rewards provide little interaction‑level gain, while transition‑aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic‑stage improvement. The code is available at https://github.com/black‑yt/ReCrit .

Authors:Taiki Miyagawa, Akinori F. Ebihara
Title: Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis
Abstract:
We propose non‑parametric estimators for the average run length (ARL) and average detection delay (ADD) in quickest changepoint detection (QCD) under finite and irregular sequence lengths. Although ARL and ADD are widely used as optimality criteria in theoretical and simulation studies, their application to real‑world datasets is hindered by limited and irregular sequence lengths. To address this issue, we propose non‑parametric estimators for the ARL and ADD, termed KM‑ARL and KM‑ADD, by drawing an analogy between QCD and survival analysis to model detection probabilities under sequence truncation. We derive estimation bias bounds and prove that they are asymptotically unbiased unless extrapolation is required. Experiments on simulated and real‑world datasets demonstrate their practical utility, enhancing robustness against limited and irregular sequence lengths, improving interpretability, and facilitating empirical, intuitive model selection. Our Python code is provided at https://github.com/TaikiMiyagawa/Kaplan‑Meier‑Average‑Run‑Length, offering ready‑to‑use implementations for practitioners.

Authors:Varun Kotte
Title: UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing
Abstract:
LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per‑workload threshold tuning. We present UCCI, a calibration‑first router that maps token‑level margin uncertainty to a per‑query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost‑optimal, and isotonic calibration achieves O(n^‑1/3) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction‑tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro‑F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split‑conformal routing, and a FrugalGPT‑style learned threshold. All cascade results use end‑to‑end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

Authors:Jing Chen, Shixiang Pan, Yujie Fan, Haocheng Ye, Haitao Xu, Wenqiang Xu
Title: Dimensional Balance Improves Large Scale Spatiotemporal Prediction Performance
Abstract:
Accurate spatiotemporal pattern analysis is critical in fields such as urban traffic, meteorology, and public health monitoring. However, existing methods face performance bottlenecks, typically yielding only incremental gains and often exhibiting limited cross‑domain transferability. We analyze this bottleneck through spatial and temporal entropy measures, which are used as diagnostic indicators of spatiotemporal complexity mismatch rather than as guarantees that entropy alignment alone yields better forecasting. Empirically, larger mismatch is often accompanied by higher prediction uncertainty, especially under a fixed model‑capacity budget. Guided by this diagnostic, we propose a scalable, adaptive framework that harmonizes spatial and temporal feature representations. Spatial dimensionality is compressed via low‑rank matrix embedding to preserve essential structure, while an extended temporal horizon captures long‑range dependencies and mitigates cumulative errors arising from temporal heterogeneity. Extensive experiments on urban traffic, meteorological, and epidemic datasets demonstrate substantial accuracy gains and broad applicability across the evaluated domains, suggesting that the framework is promising for a wide range of spatiotemporal tasks beyond the current study. The code is available on GitHub at https://github.com/ST‑Balance/ST‑Balance.

Authors:Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu
Title: Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Abstract:
Multimodal Large Language Models (MLLMs) still struggle with fine‑grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional‑to‑global perception gap: the same MLLM answers fine‑grained questions more accurately when conditioned on evidence‑centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision‑OPD (Vision On‑Policy Distillation), a regional‑to‑global self‑distillation framework that transfers the model's own privileged regional perception to its full‑image policy. Vision‑OPD instantiates two conditional policies from the same MLLM: a crop‑conditioned teacher and a full‑image‑conditioned student. The student generates on‑policy rollouts, and Vision‑OPD minimizes token‑level divergence between the teacher and student next‑token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground‑truth labels, reward verifiers, or inference‑time tool use. Experiments on multiple fine‑grained visual understanding benchmarks show that Vision‑OPD models achieve competitive or superior performance against much larger open‑source, closed‑source, and "Thinking‑with‑Images" agentic models.

Authors:Chenglei Yu, Chuanrui Wang, Bangyan Liao, Tailin Wu
Title: PACE: Geometry-Aware Bridge Transport for Single-Cell Trajectory Inference
Abstract:
Single‑cell trajectory inference from destructive time‑course snapshots is fundamentally ill‑posed: neither cross‑time cell correspondences nor continuous trajectories are observed, so the snapshot distributions alone do not uniquely determine the underlying dynamics. Existing optimal transport and flow‑based methods typically couple cells by Euclidean proximity at observed clock times, which can misalign trajectories when development is asynchronous and cells sampled at the same experimental time occupy different latent pseudotime stages. We propose PACE, a trajectory inference framework that recovers geometry‑consistent continuous transport dynamics from destructive time‑course snapshots through three coupled components. First, PACE constructs a state‑ and time‑dependent anisotropic Riemannian metric that assigns low transport cost along locally supported tangent directions while penalizing normal velocity components. Second, it alternates between refining cross‑time couplings under the induced path‑action cost and fitting endpoint‑preserving neural bridges between adjacent snapshots. Third, it distills the learned bridge dynamics into a global continuous‑time velocity field over cellular states. Across seven controlled and biological datasets covering nine held‑out reconstruction experiments, PACE achieves the strongest overall reconstruction performance, reducing MMD, Wasserstein‑1 distance, and Wasserstein‑2 distance by 23.7% on average relative to the strongest competing baseline. PACE also improves RNA‑velocity alignment by 15.4% on an embryoid body differentiation benchmark, without requiring explicit cell pairing, lineage tracing, or RNA‑velocity supervision during training. Code is available at https://github.com/AI4Science‑WestlakeU/PACE.

Authors:Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han
Title: Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning
Abstract:
Cooperation is central to multi‑agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter‑agent interactions. Prior robust MARL methods have primarily considered value‑oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction‑breaking adversarial learning (IBAL) framework that takes an information‑theoretic view to construct attacks that impede coordination by perturbing agents' observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent‑missing scenarios. Our code is available at https://sunwoolee0504.github.io/IBAL.

Authors:Ligong Han, Kai Xu, Hao Wang, Akash Srivastava
Title: SNLP: Layer-Parallel Inference via Structured Newton Corrections
Abstract:
Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden‑state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton‑style updates. While this view is principled, exact Newton corrections require expensive Jacobian‑vector products and naive fixed‑point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture‑induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix‑sum‑like update; in mHC‑style architectures, HC Newton (HCN) uses the model's residual mixing matrix. We also study SNLP‑aware training, including pretraining regularization and direct SNLP‑forward SFT. Experiments on Nanochat‑scale Transformers show that SNLP exposes a practical speed‑quality frontier: on 0.5B models, it reaches up to 2.58x wall‑clock speedup, and a less aggressive configuration reaches 1.40x speedup without increasing PPL. The useful tradeoff comes from the biased finite‑iteration computation induced by IDN/HCN rather than exact recovery of the sequential trace. We further show that SNLP‑forward SFT can preserve downstream task accuracy, and that SNLP can serve as a drafter for self‑speculative decoding while a sequential verifier preserves output correctness.

Authors:Chenxi Wang, Xiaorong Wang, Peiyang Li, Yi Wang
Title: GenTS: A Comprehensive Benchmark Library for Generative Time Series Models
Abstract:
Generative models have demonstrated remarkable potential in time series analysis tasks, like synthesis, forecasting, imputation, etc. However, offering limited coverage for generative models, existing time series libraries are mainly engineered for discriminative models, with standardized workflows for specific tasks, such as optimizing Mean Squared Errors for time series forecasting. This rigid structure is fundamentally incompatible with the distinct and often complex paradigms of generative models (e.g., adversarial training, diffusion processes), which learn the underlying data distribution rather than a direct input‑output mapping. To this end, we proposed GenTS, a comprehensive and extensible benchmark library designed for systematic assessment on generative time series models. GenTS features a unified data preprocessing pipeline, a collection of versatile models, and panoramic evaluation metrics. Its modular design also enables the researchers to flexibly customize beyond our built‑in datasets and models. Based on GenTS, we conducted benchmarking experiments under diverse tasks, accordingly offering suggestions for model selection and identifying potential directions for future research. Our codes are open‑source at https://github.com/WillWang1113/GenTS. The official tutorials and document are available at https://willwang1113.github.io/GenTS/.

Authors:Egor Shvetsov, Aleksandr Serkov, Shokorov Viacheslav, Redko Dmitry, Vladislav Goloshchapov, Evgeny Burnaev
Title: Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes
Abstract:
The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross‑entropy loss, the gradient with respect to positive pre‑activations is non‑negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT‑nano, MP‑SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90% in GPT‑nano. We characterize the sparsity‑accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above ~70% activation sparsity. While ReLU^2 achieves a good sparsity‑‑accuracy ratio in GPT‑nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT‑nano. Code is available at https://github.com/On‑Point‑RND/BugOrFeature.

Authors:Jack Wilkie, Hanan Hindy, Christos Tachtatzis, Miroslav Bures, Robert Atkinson
Title: Few-Shot Network Intrusion Detection Using Online Triplet Mining
Abstract:
Network intrusion detection systems play a vital role in protecting networks by detecting malicious network traffic which can then be investigated by a cybersecurity operations centre. State‑of‑the‑art approaches utilise supervised machine learning methods to train a classification model to recognise known cyberattacks; however, these models require a large labelled dataset to train and show poor performance when trained on smaller datasets. In an attempt to address this shortcoming, anomaly detection models learn the distribution of benign traffic and flag non‑conforming traffic as malicious. While these methods do not require malicious examples to train, they suffer from high false‑positive rates rendering them impractical. As a result, networks may be particularly vulnerable when there are insufficient labelled instances of a specific attack class to train an effective classifier. This often occurs in newly established networks or when previously unseen types of attacks emerge. To address this challenge, this work proposes the use of a triplet network, utilising online triplet mining and a KNN classifier, which is able to perform few‑shot classification, enabling effective intrusion detection after being trained on a limited number of malicious examples. Various online triplet mining algorithms were explored and model design choices, such as the inference algorithm and optimised distance metrics, were compared and evaluated through a series of ablation studies. The final model was compared against other state‑of‑the‑art approaches in few‑shot binary and multiclass classification, where the proposed approach was found to be competitive with existing methods when trained on as little as 10 malicious samples of each class.

Authors:Zhiquan Tan, Yinrong Hong
Title: Self-Supervised On-Policy Distillation for Reasoning Language Models
Abstract:
GRPO‑style RLVR trains reasoning models from multiple on‑policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self‑generated witness of how the current policy can solve the problem, while a wrong completion provides on‑policy prefixes where the policy needs correction. We introduce \emphSelf‑Supervised On‑Policy Distillation (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra‑group correct‑‑wrong contrast into dense process supervision without external solution traces. A stopping‑time view motivates the shortest‑correct / longest‑wrong rule as a finite‑group approximation to editing persistent failures toward fast‑success actions, and a prompt‑level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model‑benchmark settings. On Qwen3‑8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution‑conditioned OPSD baseline by 0.8 points. Code will be released at https://github.com/tzq1999/SSOPD.

Authors:Meng Li, Xiaohua Yang, Jie Liu, Shiyu Yan
Title: A semantic mutation metric for metamorphic relation adequacy in scientific computing programs
Abstract:
Context. Metamorphic Testing addresses the test‑oracle problem in scientific computing, but classical Mutation Score operates on syntactic AST mutations and misses domain semantics. Objective. We propose the Semantic Mutation Score (SMS), built on five domain‑semantic operators (Conservation Erosion, Operator Substitution, Hyperparameter, Trajectory Flip, Structural Injection). SMS degenerates almost everywhere to MS in a characterised limit, so any SMS‑based conclusion remains consistent with prior mutation‑testing literature in the classical regime. Method. A 12‑PUT x 5‑MP design over four single‑output float‑to‑float classes (numeric, probabilistic, surrogate, machine‑learning) is paired with a three‑layer attribution classifier separating true semantic faults from tolerance, OOD, statistical, and artefact categories. A same‑source / cross‑source ablation under an identical prompt isolates the LLM‑source‑diversity contribution. LLM‑generated mutants are compared against a default‑configuration cosmic‑ray syntactic pool at the AST‑normalised level. Results. The pre‑registered large‑effect threshold for Cliff's delta is not met under the point‑estimate criterion; the observed effect lies in the medium‑effect range. Cross‑source pooling under an identical prompt does not appreciably shift delta, indicating that LLM identity is not the lever within this design. AST‑level overlap between LLM‑generated and default cosmic‑ray syntactic mutants is small; the Hyperparameter, Structural Injection, and Trajectory Flip classes are unreachable under default first‑order syntactic configurations. Conclusion. SMS is a backward‑compatible adequacy metric for domain‑semantic metamorphic‑relation sets in scientific computing. The first‑order unreachability evidence is independent of the effect‑size question.

Authors:Tarun Sharma
Title: IVF-TQ: Streaming-Robust Approximate Nearest Neighbor Search via a Codebook-Free Residual Layer
Abstract:
We propose IVF‑TQ, an IVF index with a codebook‑free residual layer: a fixed random rotation followed by precomputed Lloyd‑Max scalar quantization depending only on (b, d). Only the IVF coarse partition is trained. Building on TurboQuant (Zandieh et al., 2025), the design substantially reduces a key failure mode of trained‑codebook ANN indexes (PQ, OPQ, ScaNN): staleness under streaming ingestion.Empirical (3 seeds): Per‑batch PQ retraining does not recover the streaming gap at any tested bit budget (paired‑t p > 0.28 everywhere). On streaming Deep‑10M, IVF‑TQ holds at 87.4% ‑> 86.6% (Delta = ‑0.80 +/‑ 0.10pp) while IVF‑PQ degrades ‑3.23pp. A shuffled‑i.i.d. control on SIFT‑1M shows IVF‑PQ losing ‑3.9pp without distribution shift. At higher PQ bit budgets (~1.5x IVF‑TQ memory), absolute recall favors PQ as expected from rate‑distortion (+6.1pp Deep‑10M; +2.0pp SIFT‑10M); the durable IVF‑TQ benefit is operational (no codebook to retrain), robust across memory regimes.Prior art: IVF around a codebook‑free residual quantizer is architecturally not new ‑‑ IVF‑RaBitQ ships in Milvus, cuVS, LanceDB, Weaviate; Shi et al. (2026) is concurrent GPU work. TurboQuant itself tests only flat‑rotation ANN.Contributions: (i) A multi‑seed streaming‑operational story for codebook‑free IVF: 10M‑scale evidence across PQ memory budgets. (ii) A uniform‑over‑sphere IP‑error bound for the TQ residual quantizer with one fixed rotation (proof sketch in v1; rigorous in v2). (iii) Adaptive IVF‑TQ: a partition‑only refresh recovering 67% ‑> 97.8% under worst‑case rotation shift with re‑ranking (90.3% without).Code, data: https://github.com/tarun‑ks/turboquant_search

Authors:Meng Li, Xiaohua Yang, Jie Liu, Shiyu Yan
Title: NOETHER: A Constructive Framework for Metamorphic Pattern Discovery from Operator Algebras
Abstract:
Context. Metamorphic Testing is recognised in IEEE/ISO software‑testing standards and increasingly recommended for AI systems, but its progress is bottlenecked by metamorphic relation (MR) identification: existing approaches (structured frameworks, mining and evolutionary pipelines, LLM‑assisted methods, MetaPattern catalogues) share an inductive grounding that leaves three foundational questions open: origin, closure, and transferability. Objective. We propose a framework whose downstream step from program‑induced operator algebra to MetaPattern set is mechanical and provable, while the upstream curation of the algebra is a stated empirical hypothesis with explicit scope precondition. Method. NOETHER is a two‑layer framework. The upstream layer is an eight‑block decomposition over recurrent mathematical structures (symmetry, order, self‑adjoint, time‑reversal, limit, qualitative‑dynamics, method‑comparison, relational equivalence). The downstream CONSTRUCT‑MP algorithm produces a MetaPattern set with algebraic‑closure (Theorem 1) and polynomial‑time decidability (Theorem 2) guarantees. We test the framework on three operator‑algebraic domains. Results. On Boltzmann reactor physics NOETHER systematises a prior inductive catalogue; on equivariant ML it derives executable MRs for rotation invariance, adjoint duality, and training‑trajectory reversibility; on relational query optimisers it exercises the relational‑equivalence block. The central falsifiable prediction (L‑blindness on homogeneity‑preserving mutators) holds on the in‑scope substrate. The absolute‑completeness conjecture (Theorem 1') is falsified on PWR core diffusion via two pairwise‑independent counterexamples that identify five Translate‑extension dimensions. Conclusion. Induction is relocated from per‑program MR sampling to a per‑domain algebraic layer; the downstream step is deductive and mechanical.

Authors:Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu
Title: FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
Abstract:
AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill‑climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process‑level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML‑Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process‑level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill‑climber nearly matches the best‑performing tree‑search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree‑search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process‑level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML‑bench.

Authors:Jingru Fei, Kun Yi, Alex Xing Wang, Qingsong Wen, Xiangxiang Zhu, Wei Fan
Title: Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density
Abstract:
Time series foundation models rely on large‑scale pretraining over diverse datasets across domains, yet their heterogeneity in temporal patterns could hinder the effectiveness of training and learning transferable time series representations. Inspired a fundamental concept, normalized power spectral density (PSD) in signal processing, we assume harmonizing datasets via PSDs in the spectral domain could reduce mismatches and enhance pretraining. We then go beyond the direct intractable minimization optimization and innovatively reformulate it as a principled harmonization approach. Specifically, we propose Harmonizer, a module that reshapes spectral structures and implicitly harmonizing PSDs across datasets, which theoretically corresponds to a shared reparameterization of second‑order temporal correlations. Our theoretical analysis further reveals token interactions with Harmonizer can be efficiently mediated by a compact set of resonators, motivating a HarmonicAttention design that performs self‑attention in a low‑dimensional interaction space. Then, we propose Olivia, a novel time series foundation model built upon these harmonization mechanisms. Extensive experiments on two large‑scale benchmarks (TSLib and GIFT‑Eval) and extra 6 datasets from GluonTS, demonstrate Olivia consistently achieves state‑of‑the‑art performance under zero‑shot, few‑shot, and full‑shot forecasting scenarios. Our code is available at https://github.com/TSTS13/Olivia.

Authors:Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu, Shuai Dong, Ziyue Wang, Xinbo Xu, Kean Shi, Ruoyu Wu, Xiaoying Zhang, Wenqi Shao, Baobao Chang, Nan Duan, Jiaqi Wang
Title: Step-wise Rubric Rewards for LLM Reasoning
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final‑answer correctness with no supervision over intermediate steps. Rubric‑based methods such as Rubrics as Rewards (RaR) introduce finer‑grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi‑criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self‑correction. On 1,000 problems, we find 18.2% of steps in correct‑answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect‑answer responses are correct yet penalized. We introduce Step‑wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per‑step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per‑step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K‑problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3‑8B and 2.75 points on Qwen3‑32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self‑correction looping from 48.1% to 26.5%.

Authors:Jon Saad-Falcon, Avanika Narayan, Robby Manihani, Tanvir Bhathal, Herumb Shandilya, Hakki Orhun Akengin, Gabriel Bo, Andrew Park, Matthew Hart, Caia Costello, Chuan Li, Christopher Ré, Azalia Mirhoseini
Title: OpenJarvis: Personal AI, On Personal Devices
Abstract:
Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud‑hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5‑9B drops accuracy by 25‑39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state‑of‑the‑art prompt optimizers close just 5 pp of the local‑cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local‑cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end‑to‑end optimizable and measurable against accuracy, cost, and latency. Towards closing the local‑cloud gap without surrendering local‑model properties, OpenJarvis introduces LLM‑guided spec search, a local‑cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non‑regressing edits are accepted, and the resulting spec runs entirely on‑device at inference time. With LLM‑guided spec search, on‑device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end‑to‑end latency by 4x.

Authors:David Troxell, Yulia Alexandr, Sofia Hunt, Stephanie Lei, Guido Montúfar
Title: Stress-Testing Neural Network Verifiers with Provably Robust Instances
Abstract:
Neural network verifiers aim to provide formal guarantees on model behavior, but existing verification benchmarks are fundamentally limited by their lack of ground‑truth labels. As a result, verifier evaluation relies on indirect heuristics, which prevents exact scoring and systematic study of verifier failure modes. We address this gap by introducing a reusable framework for generating verification instances whose ground‑truth robustness labels are known a priori through analytic construction. Our framework led to the discovery of multiple numeric tolerance concerns and an implementation bug in popular verifiers, highlighting the need for ground‑truth labels. Additionally, to systematically study verifier failure modes, we introduce the verification Difficulty Profile, a collection of estimable quantities capturing distinct sources of instance hardness. Using our framework and these profiles, we evaluate five state‑of‑the‑art verifiers and show that different instances stress distinct aspects of the verification pipeline. We show that these results can aid the future development of verifiers as they provide actionable targets for improving numerical reliability, relaxation quality, and search behavior. Our code is publicly available: https://github.com/dtroxell19/VeriStressGT.git.

Authors:Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran
Title: A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation
Abstract:
Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation‑invariant or self‑attention‑based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in‑depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

Authors:David Troxell, Noah Roemer, Guido Montúfar
Title: Differentiable Optimization Layers for Guaranteed Fairness in Deep Learning
Abstract:
Differentiable optimization layers are traditionally integrated in predict‑then‑optimize frameworks where a neural model estimates parameters that subsequently serve as fixed inputs to downstream decision‑making optimization problems. In this work, we introduce the concept of a "fairness layer": a differentiable optimization layer appended to a model's output layer that guarantees a chosen notion of output parity is satisfied when integrated into a neural network. Additionally, we introduce an online primal‑dual inference algorithm that provides provable aggregate fairness guarantees for streaming predictions with arbitrarily small batch sizes, where traditional per‑batch constraints become overly restrictive. Numerical experiments demonstrate the effectiveness of the fairness layer and associated algorithm, and theoretical analysis characterizes the layer's differentiability and stability properties during model training and backpropagation. Our code for these experiments is publicly available on GitHub (https://github.com/dtroxell19/FairDL‑ICML‑2026.git) and our public Python package documentation can be found online: https://dtroxell19.github.io/fairness_training/.

Authors:Tristan Gaudreault, Yongyi Mao
Title: Parallel Recursive LSTM
Abstract:
Transformers have become the dominant architecture for sequence modeling by using self‑attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long‑context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state‑tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR‑LSTM), a hierarchical recurrent architecture that replaces left‑to‑right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR‑LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR‑LSTM achieves strong sequence‑length generalization on formal‑language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.

Authors:Sajjad Khan
Title: S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination
Abstract:
Concurrent LLM agents sharing mutable natural‑language state produce Structural Race Conditions (SRCs): write‑write and cross‑shard stale‑read conflicts that silently corrupt agent output. Existing multi‑agent frameworks (LangGraph, CrewAI, AutoGen) provide no write‑ownership semantics over shared state. We present S‑Bus, an HTTP middleware whose central mechanism is a server‑side DeliveryLog: a per‑agent log of HTTP GET operations that automatically reconstructs each agent's read set at commit time without agent SDK changes under HTTP/1.1. The consistency property the DeliveryLog provides ‑‑ Observable‑Read Isolation (ORI), a partial causal consistency over the HTTP‑observable projection of the read set ‑‑ prevents structural race conditions when agents collaborate via shared shards. Three contributions: (C1) The DeliveryLog mechanism for automatic HTTP‑traffic‑based read‑set reconstruction, with three‑tier mechanised evidence: ReadSetSoundness and ORICommitSafety machine‑checked in TLAPS (modulo one retained typing axiom); exhaustive TLC at N=3 (20,763,484 distinct states, zero violations); Dafny discharges 9 inductive soundness lemmas. (C2) Empirical structural‑conflict prevention parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI on shared‑shard contention sweeps with 427,308 active HTTP‑409 conflicts: zero Type‑I corruptions across all three backends. (C3) ORI's operating envelope is topology‑conditional: semantically neutral in dedicated‑shard workloads; harmful in single‑shard collaborative writing because preservation propagates concurrent contradictions. Source code: https://github.com/sajjadanwar0/sbus

Authors:Aleksandr Churilov
Title: The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort
Abstract:
Spracklen et al. (USENIX Security '25) showed that code‑generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open‑source models, creating an attack surface for slopsquatting ‑‑ the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code‑capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT‑5.4‑mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT‑5.4‑mini) ‑‑ an order‑of‑magnitude compression of the inter‑model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127 package names (109 on PyPI, 18 on npm) that all five evaluated models invent identically, constituting a model‑agnostic supply‑chain attack surface that no single‑model study can reveal. We further document a Python‑over‑JavaScript hallucination asymmetry that inverts Spracklen's 2024 finding, identify a Haiku‑below‑Sonnet inversion within the Anthropic family, and observe a Jaccard‑similarity peak between DeepSeek V3.2 and GPT‑5.4‑mini (J = 0.343) suggestive of shared training‑data origins.

Authors:Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald
Title: 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
Abstract:
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce 1GC‑7RC (Single Graphic Card: Seven Research Challenges), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time‑series forecasting, and text classification. Each task provides a locked data‑preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task‑specific wall‑clock budget (40‑120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open‑source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent‑task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time‑budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC‑7RC‑Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi‑agent settings, making it a flexible platform for future research on autonomous research agents.

Authors:Peng Cui, Boyao Yang, Jun Zhu
Title: Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Abstract:
Reinforcement Learning (RL) post‑training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning‑Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed‑form Learning‑Zone Energy Score that fuses three complementary signals, an initial‑difficulty anchor, a normalized outcome‑uncertainty term, and a pass‑rate momentum, into a single scalar that is provably aligned with the expected magnitude of group‑relative policy gradient updates. A forward pruner with replay further reduces wall‑clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen‑family models (1.5B‑8B) across GSM8K, MATH and DAPO‑MATH, our method retains only 40% of the training data per step yet matches or surpasses full‑data baselines, with especially pronounced out‑of‑distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at https://github.com/Stellaris167/LZE.

Authors:Ruth Wan Theng Chew, Zhiliang Chen, Apivich Hemachandra, Bryan Kian Hsiang Low
Title: BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks
Abstract:
Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative‑free optimization problems, Bayesian optimization (BO) and other black‑box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample‑efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small‑scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM‑centric benchmark that democratizes LLM research for the BBO community. BoLT is released at https://github.com/chewwt/bolt. BoLT covers broad and well‑motivated LLM optimization problems, involving multi‑fidelity, multi‑objective, heteroscedastic noise, and high‑dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.

Authors:Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf
Title: Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings
Abstract:
Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English‑language inputs, conditions that do not reflect the realities of healthcare delivery in low‑resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross‑lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine‑tuned on the COVID‑QU‑Ex chest X‑ray dataset (85,318 images; COVID‑19, Non‑COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N‑ATLAS) on 20 COVID‑19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba‑inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African‑context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.

Authors:Shilong Jin, Lanjun Wang, Zhuosheng Zhang
Title: SE-GA: Memory-Augmented Self-Evolution for GUI Agents
Abstract:
Autonomous Graphical User Interface (GUI) agents often struggle with multi‑step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self‑Evolving GUI Agent (SE‑GA), a novel framework that integrates hierarchical memory structures with an iterative self‑improvement mechanism. At the core of our approach is Test‑Time Memory Extension (TTME), which facilitates long‑term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory‑Augmented Self‑Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE‑GA achieves state‑of‑the‑art performance, reaching success rates of 89.0% on ScreenSpot and 75.8% on the challenging AndroidControl‑High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments. Open source code: https://github.com/jinshilong‑dev/SE‑GA

Authors:Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey
Title: Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning
Abstract:
Compositional diffusion models offer a promising route to long‑horizon planning by denoising multiple overlapping sub‑trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute‑heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long‑horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search‑guided Diffuser (XDiffuser) first computes a plan over a state‑space graph ‑‑ serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion‑based baselines on long‑horizon tasks, with particularly large gains in the low‑quality data regime and on unseen tasks beyond goal‑reaching, including multi‑agent coordination and TSP‑style reasoning. Project website: https://yanivhass.github.io/XDiffuser‑site/

Authors:Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen
Title: Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
Abstract:
Knowledge distillation is central to LLM post‑training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off‑policy distillation and on‑policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token‑level KL direction. This follows from decomposing sequence‑level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token‑level forward KL, and reverse KL pairs student prefixes with token‑level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient‑level identities showing forward KL gives SFT‑style cross‑entropy matching with teacher soft targets, whereas reverse KL gives an RL‑style policy‑gradient objective with a dense teacher‑student log‑ratio reward, connecting them to off‑policy SFT, DAgger‑style on‑policy SFT, offline‑RL‑style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy‑entropy tradeoff, prefix source a quality‑compute tradeoff, and training length an accuracy‑stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy‑gated length curriculum. KL mixing shows long‑sequence distillation requires substantial forward‑KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy‑gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long‑horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.

Authors:Shuo Liu, Ding Liu, Shi-Ju Ran
Title: Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning
Abstract:
Large language models (LLMs) generate not only reasoning text, but also token‑level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content‑agnostic confidence geometry associated with trace‑level final‑answer correctness. Using only token‑level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low‑dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies‑‑Bouldin index, consistently corresponds to higher correctness‑discrimination AUC. We further show that correctness‑related information is enriched in the tail of reasoning, suggesting that late‑stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf‑derived scores improve confidence‑weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace‑intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.

Authors:Youngmok Ha, Viktor Schlegel, Yidan Sun, Anil Anthony Bharath
Title: Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy
Abstract:
While Local Differential Privacy (LDP) serves as a foundational primitive for distributed data collection, its stringent noise injection requirement often leads to severe degradation in data utility. This degradation stems from the task‑agnostic nature of conventional LDP mechanisms, which inject noise uniformly across all dimensions regardless of their relative importance to the downstream objective. To address this issue, we propose a novel approach that mitigates noise in task‑relevant subspaces of the data representation. Our method identifies task‑critical subspaces via the Jacobian matrix of the public downstream model, selectively attenuates noise along those dimensions, and reshapes the isotropic noise of standard LDP into an anisotropic distribution. This method preserves the uniform per‑dimension privacy budget while heterogeneously modulating noise impact across dimensions, thereby substantially enhancing data utility. Furthermore, our approach generalizes to both linear and non‑linear models and integrates seamlessly with existing mechanisms. Extensive experiments on CIFAR‑10‑C (Brightness corruption at the highest severity level 5) demonstrate that integrating our approach improves the utility of PrivUnit2 and PrivUnitG by approximately 20% at ε=7.5. The source code is available at https://github.com/ymha/jacobian‑anr‑ldp.

Authors:Yangyou Liu, Zezhi Shao, Xinyu Chen, Hu Chen, Fei Wang, Yuankai Wu
Title: PULSE: Generative Phase Evolution for Non-Stationary Time Series Forecasting
Abstract:
Time series forecasting under non‑stationarity faces a fundamental tension between capturing stable representations and adapting to distribution shifts. Existing methods implicitly rely on static historical assumptions, leading to a critical failure mode we term Phase Amnesia, where models become blind to the evolving global context. To resolve this, we formalize non‑stationary dynamics through three physical hypotheses: wold decomposition, dynamical phase evolution, and heteroscedastic manifold generation. These principles inspire PULSE, a physics‑informed, plug‑and‑play framework adopting a Disentangle‑‑Evolve‑‑Simulate design philosophy. Specifically, PULSE utilizes phase‑anchored disentanglement to resolve optimization interference caused by dominant trends, employs a Phase Router to actively generate future trajectories, and introduces Statistic‑Aware Mixup (SAM) to ensure robustness against out‑of‑distribution volatility. Empirically, PULSE enables a simple MLP backbone to achieve state‑of‑the‑art or highly competitive performance across 12 real‑world benchmarks. This validates that a correct physics‑informed inductive bias is far more critical than raw architectural complexity for non‑stationary forecasting. The code is available at: https://github.com/Gemost/PULSE.

Authors:Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Title: TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition
Abstract:
Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi‑step composition settings. Outcome‑based rewards provide only sparse feedback, while trajectory‑supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory‑Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence‑level feedback derived from fine‑grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory‑supervised rewards collapse beyond step‑4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi‑level supervision for compositional reasoning.

Authors:Yulin Chen, He He, Chen Zhao
Title: The Unlearnability Phenomenon in RLVR for Language Models
Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross‑example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \urlhttps://github.com/yulinchen99/unlearnability‑rlvr.

Authors:Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen
Title: Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning
Abstract:
Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD‑based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR‑based unlearning risks the re‑emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion (\mathrmD^2), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement \mathrmD^2, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy‑based Unlearning Alignment (EUA) to enforce energy‑boundary unlearning during training and apply an energy‑based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of \mathrmD^2. Our code is available at https://github.com/Puning97/EUA‑for‑LLM‑Unlearning.

Authors:Siqi Zeng, Christopher Jung, Rui Li, Zhe Kang, Ming Li, Nima Noorshams, Zhigang Wang, Fuchun Peng, Han Zhao, Xue Feng
Title: Convex Dataset Valuation for Post-Training
Abstract:
Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during post‑training. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all available data, necessitating principled dataset‑level selection. These constraints are increasingly shaped by dataset marketplaces, where data acquisition is governed by budgets and negotiation. We study dataset valuation as a subset selection problem during LLM post‑training. Our goal is to identify and weight auxiliary datasets so as to maximize target task performance given constrained budgets. We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a scalable convex dataset‑level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post‑training settings and tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead. Our results position dataset valuation as a practical decision tool for post‑training data selection in market‑constrained large language model settings. The code is available at https://github.com/uiuctml/convex_data_valuation.

Authors:Kyrie Zhao, Zehong Wang, Tianyi Ma, Fang Wu, Xiangru Tang, Pietro Lio, Sheng Wang, Yanfang Ye
Title: Hypergraph Pattern Machine: Compositional Tokenization for Higher-Order Interactions
Abstract:
Hypergraphs model higher‑order relations that drive real‑world decisions, from drug prescriptions to recommendations. A central structural signal in such data, beyond what pairwise relations can express, is interaction compositionality: whether a higher‑order relation is compositional, emergent, or inhibitory with respect to its observed or unobserved sets. In polypharmacy, the regime decides whether a drug should be dropped, kept, or excluded: a compositional drug triple can be safely simplified, an emergent triple requires all drugs jointly, and an inhibitory triple flags a drug that disrupts an existing interaction. However, existing hypergraph learning methods, which merely propagate messages over observed hyperedges, leave this compositional signal unmodeled, allowing dangerous drug combinations to slip through and be misclassified. To this end, we propose the Hypergraph Pattern Machine (HGPM), shifting the paradigm from message passing to learning the compositional pattern of subsets. It tokenizes compositional subsets, organizes them in an inclusion DAG, and trains an inclusion‑aware Transformer under masked reconstruction. On ten hypergraph benchmarks, HGPM matches or exceeds state‑of‑the‑art methods. Notably, in a real adverse‑event prediction case, HGPM correctly identifies the drug addition that inhibits the side effect among feature‑identical candidates, a discrimination existing methods cannot make. The code and data are in https://github.com/KryieZhao/HGPM.git.

Authors:Amin Karimi Monsefi, Abolfazl Meyarian, Mridul Khurana, Shuheng Wang, Pouyan Navard, Cheng Zhang, Anuj Karpatne, Wei-Lun Chao, Rajiv Ramnath
Title: SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability
Abstract:
Animals are described as effectively camouflaged when they blend seamlessly with their surrounding, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well‑camouflaged animal is one that remains difficult to detect even when its category is known. We introduce SeamCam (Seamless Camouflage), a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category‑conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground‑truth mask. The SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage (i.e., lower detectability). In a human two‑alternative forced‑choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming state‑of‑the‑art by about 25%. We then demonstrate SeamCam's utility as a preference signal for Direct Preference Optimization (DPO) to fine‑tune a diffusion‑based inpainting model for camouflage generation. This offers an affordable training approach with an objective explicitly suited for camouflage generation, unlike typical diffusion models. To support rigorous benchmarking, we further introduce CamFG‑1.5k, a curated dataset of 1,521 high‑resolution images in which animals are fully visible prior to camouflage generation, enabling unbiased evaluation by controlling for occlusion artifacts present in existing datasets. https://7amin.github.io/SeamCam/

Authors:Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song
Title: Identifiable Token Correspondence for World Models
Abstract:
Token‑based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long‑horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next‑frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token‑based transformer world models that formulates next‑frame prediction as a structured assignment problem with latent token correspondence variables: each next‑frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state‑of‑the‑art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax‑classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu‑mllab/Identifiable‑Token‑Correspondence.

Authors:Jongho Yoon, Jinsung Jeon, Seokhyeong Kang
Title: Physics-Guided Geometric Diffusion for Macro Placement Generation
Abstract:
Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data‑driven placement methods have demonstrated significant potential, yet they often struggle to handle sequential dependencies and to balance topological connectivity with physical constraints. To bridge this gap, we propose MacroDiff+, a physics‑guided geometric diffusion framework. Specifically, we design a dual‑domain denoising architecture that couples topological connectivity encoded by heterogeneous GNNs with global geometric context modeled by a Transformer. Furthermore, we introduce Physics‑Guided Sampling, an inference strategy that actively steers the generation using explicit gradients to ensure both statistical plausibility and physical validity. On the ISPD2005 MMS benchmarks, MacroDiff+ outperforms state‑of‑the‑art baselines with a 6.1‑6.2% reduction in wirelength. Notably, it exhibits superior stability and scalability on large‑scale designs where prior methods fail to converge. The source code is available at https://github.com/jhy00n/MacroDiff‑plus.

Authors:Jiajian Li, Jingyuan Huang, Junru Gong, Qi Wang, Xiaokang Yang, Yunbo Wang
Title: OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence
Abstract:
We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically‑grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end‑to‑end differentiability throughout the entire simulation loop ‑‑ spanning from explicit state transitions to visual observation generation ‑‑ OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient‑based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state‑of‑the‑art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.

Authors:Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce
Title: Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
Abstract:
Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning‑preserving input variations. We analyze this instability at the representation level under three semantic‑preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers. We attribute this instability to over‑reliance on predictive yet brittle features, which we term unstable features, and isolate them via Sparse Autoencoders (SAEs) in a sparse latent space where benign and perturbed inputs activate distinctly separable patterns. Building on this separability, we propose two SAE‑based instability mitigation strategies: SAE Feature Steering, which identifies and suppresses anomalously activated features at inference, and SAE Residual Correction, which learns adaptive adjustments over SAE features to restore correct preferences. Our methods substantially reduce incorrect preference assignments on harmlessness and hallucination benchmarks while preserving benign performance and general utility on other tasks, without retraining the reward model. Our code and data are available in \urlhttps://github.com/shunchang‑liu/pisa.

Authors:Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song
Title: ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning
Abstract:
LLM‑based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge‑‑operator schemas, preconditions, and constraints‑‑remains unrepaired. Existing self‑evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro‑symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure‑Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi‑dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi‑seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs‑‑strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72‑100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring‑failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight‑level and prompt‑level adaptation for persistent fault elimination.

Authors:Allen Lu, Isabella Luong, Joyee Chen
Title: MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment
Abstract:
Single‑turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow‑up conversational turns introduce economic, social, or authority‑based arguments. We introduce MANTA (Multi‑turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi‑turn evaluation framework built on the Inspect AI platform that stress‑tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow‑up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB‑derived scoring dimensions on a continuous 0‑1 scale. We present preliminary results from evaluations of claude‑sonnet‑4‑20250514 and openai/gpt‑4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence‑based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first‑order practical scenarios. We additionally present STYLEJUDGE, a controlled four‑judge study demonstrating systematic format bias in LLM‑as‑judge evaluation, with directly actionable implications for MANTA's scorer design. Code, dataset, and evaluation logs are available at https://github.com/Mycelium‑tools/manta.

Authors:Ruichen Zheng, Biao Zhang, Michael Birsak, Mikhail Skopenkov, Peter Wonka
Title: Patchwork: A compact representation for 3D polygonal shapes
Abstract:
We introduce Patchwork, a new general‑purpose shape representation capable of modeling 2D and 3D geometry with a small number of parameters. Patchwork is grounded in a rigorous mathematical framework, providing provable complexity bounds and the ability to approximate arbitrary shapes with arbitrary precision in any dimension. We propose an efficient gradient‑based optimization scheme to fit Patchwork representations to 2D and 3D data, along with a novel regularization loss that progressively prunes redundant elements, yielding high compactness after convergence. Our approach offers fast fitting performance, a fraction of the required parameters compared to existing alternatives, and native support for inside‑outside classification, making it a versatile and compact representation for geometric learning and reconstruction tasks, with future potential for 3D generation. Our implementation is available at: https://github.com/Ankbzpx/patchwork‑experiment.

Authors:Shuchan Wang
Title: Dynamics-Level Watermarking of Flow Matching Models with Random Codes
Abstract:
We introduce a dynamics‑level approach to watermarking generative models. Rather than embedding signals into model weights or outputs, we embed the watermark directly into the learned continuous dynamics ‑‑ the velocity field of a flow matching model. We formulate this as random coding over a continuous channel: a key‑dependent perturbation is added during training, and the message is recovered at detection time from black‑box queries. The perturbation is designed to leave the generated distribution unchanged. Experiments on MNIST and CIFAR‑10 across different architectures confirm reliable message recovery, preserved generation quality, and chance‑level decoding accuracy without the secret key.

Authors:Gwenolé Quellec
Title: Constrained latent state modeling: A unifying perspective on representation learning under competing constraints
Abstract:
Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, rather than as mere compressed summaries of observations. Yet current approaches remain fragmented, relying on distinct ‑‑ and often implicit ‑‑ assumptions about what these states should represent. We argue that this fragmentation reflects a more fundamental limitation: latent representations are typically learned from underconstrained objectives that fail to specify the properties that meaningful latent states should satisfy. As a result, multiple representations can satisfy the same objective, leading to ambiguity in their structure and interpretation. While many of the underlying principles have been explored in isolation, their interactions have not been explicitly formalized. In this work, we propose constrained latent state modeling (CLSM) as a unifying perspective. We identify a set of core properties ‑‑ predictive sufficiency, minimality, temporal coherence, observation compatibility, invariance to nuisance factors, and structural constraints ‑‑ and show that they are intrinsically coupled through fundamental trade‑offs. Revisiting major modeling families through this lens, we show that existing approaches can be interpreted as enforcing different subsets of constraints, thereby occupying distinct regions of a common design space. This perspective reframes persistent challenges such as lack of identifiability as consequences of underconstrained formulations, rather than isolated technical limitations. More broadly, CLSM provides a principled framework to make design choices explicit, to analyze trade‑offs, and to guide the development of more interpretable, robust, and task‑aligned latent state models.

Authors:Himanshu Singh Baghel
Title: ADAPT: A Self-Calibrating Proactive Autoscaler for Container Orchestration
Abstract:
Proactive autoscaling for containerized workloads depends on knowing the provisioning delay, i.e., the time between a scaling decision and the moment new capacity is ready to serve traffic. In practice, this cold‑start duration can vary substantially across environments and even across consecutive scale‑out events. We present ADAPT (Adaptive Duration Approximation for Predictive Timing), an online EWMA estimator that tracks coldstart duration at runtime. ADAPT feeds a dynamic planning horizon, FH‑OPT, into a Model Predictive Controller (MPC) that optimizes replica counts over a rolling window. Together, these components form a closed‑loop proactive autoscaling design that adapts its lookahead based on measured provisioning delay. Evaluated across three policies (MPC+LSTM, MPC+Prophet, HPA) and six workload archetypes with five random seeds, MPC+LSTM achieves below 5% SLA violation on all workloads, compared with 7‑19% for reactive HPA and up to 28.7% for MPC+Prophet on bimodal traffic.

Authors:Qingyuan Yang, Dongyue Chen, Da Teng, Junhua Xiao, Jiaji Pan, Shizhuo Deng
Title: FRWKV+: Adaptive Periodic-Position Branch Interaction for Frequency-Space Linear Time Series Forecasting
Abstract:
Long‑term time series forecasting is essential for decision making in energy, finance, transportation, and healthcare systems. Recent lightweight forecasting models improve efficiency by operating in transformed or linearized spaces, but two challenges remain in frequency‑space forecasting. The real and imaginary streams of complex spectra contain complementary information that is often weakly exchanged, and periodic‑position cues can help recurring patterns only when they are reliable for the current dataset and prediction horizon. To address these challenges, we propose FRWKV+, an enhanced FRWKV forecasting model for selective periodic‑position branch interaction. FRWKV+ first introduces cross‑branch gates that exchange compact contexts between the real and imaginary frequency streams, allowing each stream to modulate the other. It then uses the Adaptive PhaseGate mechanism to extract periodic‑position context and generate signed corrections to these gates. An adaptive trust mechanism controls the correction strength at the sample, variable, and channel levels, so periodic‑position information is admitted as a reliable correction signal while preserving the efficiency of the FRWKV backbone. External benchmark tables report a separately labeled FRWKV‑family selected system for manuscript‑level comparison, while mechanism‑level claims are based on strict matched‑seed FRWKV‑family ablations and representative component‑level ablations. Under this matched protocol, FRWKV+ achieves the largest MSE winner coverage among the family variants and provides clear gains in selected periodic regimes. Component analysis further supports the usefulness of periodic‑position context, signed correction, and adaptive trust in these regimes, while revealing boundary cases where simpler correction rules remain preferable.

Authors:Jinuk Kim, Junsoo Byun, Donghwi Hwang, Seong-Jin Park, Hyun Oh Song
Title: Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation
Abstract:
Manufacturable chip layouts must satisfy thousands of geometry‑based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor‑intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning‑based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large‑scale benchmark for DRC script coding agents with 1,000 rule‑to‑script tasks and 13,921 evaluation chip layouts for execution‑based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best‑of‑N selection performance in this domain. We release the code at https://github.com/snu‑mllab/Rule2DRC.

Authors:Pranav Somu, Advay Balakrishnan, Stepan Kravtsov, Aaron McDaniel, Jason Zutty
Title: Towards Code-Oriented LM Embeddings for Surrogate-Assisted Neural Architecture Search
Abstract:
Developing effective surrogates (performance predictors) for Neural Architecture Search (NAS) typically requires expensive fine‑tuning or the engineering of complex representations. We propose a low‑cost embedding strategy that leverages the inductive bias of Language Models (LMs) to eliminate these overheads. By representing architectures as PyTorch class definition text, we demonstrate that off‑the‑shelf LMs act as competitive feature extractors without NAS‑specialized fine‑tuning. The final predictor is constructed by passing the extracted Code‑Oriented LM Embeddings (COLE) through a lightweight regression head. We also investigate strategies to improve embedding quality and utilization. Our experiments on the NAS‑Bench‑201 and einspace search spaces reveal that raw code inputs yield higher predictive performance than other text‑based encodings (e.g., ONNX‑to‑text encodings) when using frozen LMs. We also observe COLE drives superior surrogate‑assisted search using the BANANAS algorithm in NAS‑Bench‑201. When optimizing for CIFAR‑100 performance, replacing structural path encodings with COLE for architecture representation allows for a 34% decrease in the evaluation budget required to reach within 1% of the fittest architecture in the search space (by test accuracy). As any neural architecture can be represented as code, these findings establish COLE as a versatile and efficient foundation for advancing NAS.

Authors:Ali Abbasi, Chayne Thrash, Haoran Qin, Hamed Pirsiavash, Soheil Kolouri
Title: IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
Abstract:
Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource‑constrained and latency‑sensitive settings. SVD‑based post‑training compression offers a hardware‑agnostic way to reduce model size and improve inference efficiency through low‑rank factorization. However, existing methods often rely on input‑only whitening spaces, homogeneous rank allocation, or loss‑agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input‑Output Whitened SVD (IO‑SVD), a post‑training compression method that forms a KL‑aware double‑sided whitening space for model weights. Using a second‑order expansion of the KL loss over the top‑K token probabilities, IO‑SVD constructs an output‑side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank‑allocation strategy that scores whitened singular components using first‑order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD‑quantization compression through loss‑aware remapping, which selects low‑rank factor rows for 8‑bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference‑time analysis shows that IO‑SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at https://github.com/mint‑vu/IO‑SVD.git

Authors:Hojun Chung, Junseo Lee, Songhwai Oh
Title: Offline Reinforcement Learning with Universal Horizon Models
Abstract:
Model‑based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on‑policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self‑generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite‑horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long‑horizon reasoning. Project page: https://rllab‑snu.github.io/projects/UHM/

Authors:Jiale Liu, Jungang Li, Jieming Yu, Xinglin Yu, Zihao Dongfang, Zongjian Ding, Kaifeng Ding, Yi Yang, Lidong Chen, Yang Zou, Shunwen Bai, Jiahuan Zhang, Haoran Huang, Shan Huang, Yudong Gao, Mingjun Cheng
Title: CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage
Abstract:
Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry‑consistent panoramic training interface. Dense trajectories duplicate nearby views, source‑specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth‑inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB‑D‑pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage‑Oriented Viewpoint curation with ERP Range‑depth warping), a training‑free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage‑style approximation behavior up to an additive error term. Using COVER, we build CM‑EVS (Coverage‑curated Metric ERP View Set), a panoramic RGB‑D‑pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re‑encoded into the same schema. Each frame provides full‑sphere RGB, metric range depth, calibrated pose; COVER‑produced indoor frames include per‑step provenance logs. With a median of only 25 frames per indoor scene, CM‑EVS covers all 13 unified room types while maintaining compact scene‑level coverage. Experiments show that COVER improves the coverage‑conflict trade‑off, making CM‑EVS a sparse, compact, and auditable RGB‑D‑pose resource for geometry‑consistent panoramic 3D learning.

Authors:Minseo Kim, Huanghao Mai, Jay Shenoy, Alec Follmer, Gordon Wetzstein, Frederic Poitevin
Title: CrystalBoltz: End-to-End Protein Structure Determination via Experiment-Guided Diffusion for X-Ray Crystallography
Abstract:
Generative models trained on public databases of protein structures, most of which have been determined by X‑ray crystallography, now provide powerful priors for structure prediction. However, they are not readily conditioned on the measurements from a new crystallographic experiment, limiting their use for X‑ray structure determination. In crystallography, the measured structure‑factor amplitudes do not by themselves determine an electron density map or atomic structure because the associated phases are unobserved and must be inferred. Structure determination therefore remains an inverse problem in which candidate models must be both structurally plausible and consistent with measured diffraction data, often requiring substantial manual refinement by human experts. Emerging methods aim to incorporate experimental information more directly into predictive and refinement workflows. We present CrystalBoltz, a generative framework that casts crystallographic refinement as Bayesian inference over atomic structures and operates directly on structure‑factor amplitudes. CrystalBoltz moves from unguided generation with a pre‑trained prior over protein structures to experiment‑guided posterior sampling, followed by atomic coordinate and B‑factor refinement. Across multiple protein crystallography datasets, CrystalBoltz attains lower coordinate RMSD and lower R‑factors than the strongest baselines considered, while reducing runtime by a factor of 33 relative to existing experimentally guided refinement.

Authors:De Shuai Zhang
Title: When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation
Abstract:
Continuous diffusion and flow models are attractive for non‑autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft‑conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian‑start experiments showed that good latent‑space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high‑entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768‑dimensional BERT latents recover tokens much better than compressed 256‑dimensional latents. With 768‑dimensional latents, DraftPrior target‑token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder‑aware readout give modest additional gains, while metric learning and OT‑style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder‑readable structure.

Authors:Louisa Cornelis, Johan Mathe, Louis Van Langendonck, Guillermo Bernárdez, Nina Miolane
Title: OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data
Abstract:
Graph Neural Networks (GNNs) have become the dominant framework for inductive graph‑level learning. Yet most benchmarks focus on the regime n \gg p, where the number of graphs n greatly exceeds the number of nodes per graph p. This overlooks biological domains such as omics, which operate in the opposite n \ll p regime, characterized by large graphs of genes, transcripts, or proteins across few patient samples. This raises the question: how do GNNs perform in this low‑sample, high‑node omics setting? We introduce \textttOgBench (Omics‑Graph Bench), the first benchmarking platform for graph‑level prediction in the n \ll p regime characteristic of omics data. We provide a standardized, end‑to‑end modular infrastructure from raw omics data to families of featured graphs with varied structural properties. We benchmark classical GNNs, as well as GNNs designed for large graphs and omics applications, alongside MLPs and machine learning baselines to establish reference performances. Our results show that widely used GNNs often do not outperform simple MLPs and classical baselines. These findings challenge the prevailing assumption that graph structure inherently adds value in this domain, fostering a critical reassessment of current learning paradigms. Ultimately, by exposing these limitations, OgBench provides the open‑source ecosystem necessary for the community to develop and validate novel architectures explicitly tailored for biological graphs. The code is available at https://github.com/geometric‑intelligence/ogbench.

Authors:Yijun Lu, Zilei Yang, Yuyin Ma
Title: parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning
Abstract:
While Isaac Lab provides massive parallel UAV simulation, OmniSafe and safe‑control‑gym provide constrained‑RL benchmarks, and CBFKit provides control‑barrier‑function synthesis tooling, no existing framework unifies these capabilities for end‑to‑end safety‑constrained training. ParallelCBF is the first framework to unify (i)~tensor‑parallel UAV environments, (ii)~hard‑gate CBF safety filters, (iii)~sharded BC‑to‑RL pipelines, and (iv)~first‑class operational auditability ‑‑ pre‑registration, watchdog registries, failure forensics, and dataset audits as composable APIs rather than user‑implemented scripts. We release ParallelCBF v0.1.0 under Apache~2.0 with a four‑layer composable API, a CPU PyTorch reference implementation of a dual‑barrier (squared / linear‑predictive) CBF, property‑based safety invariance tests across vectorized batch sizes that complete in 1.67~s for the full 39‑test suite, and a 31,415‑episode behavior‑cloning collection campaign whose curriculum mix, per‑bucket yields, and dataset SHA‑256 are auditable through the framework's own \textttops primitives. We report a representative end‑to‑end pipeline execution in which the framework's auditability layer halted a downstream training stage that did not meet pre‑registered convergence criteria, preventing silent propagation of a degraded checkpoint ‑‑ an architectural property we argue is necessary, not merely useful, for reproducible empirical robotics research. The framework is installable via \textttpip install parallelcbf; source and release artifacts are available at https://github.com/xiaoyang‑123‑cell/ParallelCBF.

Authors:Shi-ang Qi, Vahid Balazadeh, Michael Cooper, Russell Greiner, Rahul G. Krishnan
Title: SurvivalPFN: Amortizing Survival Prediction via In-Context Bayesian Inference
Abstract:
Survival analysis provides a powerful statistical framework for modeling time‑to‑event outcomes in the presence of censoring. However, selecting an appropriate estimator from the many specialized survival approaches often requires substantial methodological and domain expertise. We introduce SurvivalPFN, a prior‑data fitted network that amortizes Bayesian inference for censored observations through in‑context learning. SurvivalPFN is pretrained on a diverse family of synthetic, identifiable, and right‑censored data‑generating processes, enabling it to amortize survival analysis in a single forward pass during inference. As a result, the model adapts to the effective complexity of each dataset without task‑specific training or hyperparameter tuning, avoids restrictive parametric assumptions, and produces calibrated survival distributions. In a large‑scale benchmark spanning 61 datasets, 21 methods, and 5 evaluation metrics, SurvivalPFN achieves strong predictive performance and often improves upon established survival models. These results suggest that SurvivalPFN offers a principled and practical foundation model for survival analysis, with potential applications in high‑impact domains such as healthcare, finance, and engineering (https://github.com/rgklab/SurvivalPFN).

Authors:Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin
Title: When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing
Abstract:
Mixture‑of‑Experts (MoE) networks promise favorable accuracy‑compute trade‑offs, yet practical vision deployments are hindered by expert collapse and limited end‑to‑end efficiency gains. We study when sparse top‑k routing with hard capacity constraints helps in vision classification, evaluated under multi‑seed protocols on four benchmarks (CIFAR‑10/100, Tiny‑ImageNet, ImageNet‑1K). We observe a \emphcompute‑leverage pattern: positive accuracy gaps require a substantial fraction ρ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi‑expert routing (k \geq 2) is additionally required. Two controlled experiments isolate these factors. A hidden‑size sweep on CIFAR‑10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet‑1K ablation that varies only top‑k ‑‑ holding architecture, initialization, and ρ fixed ‑‑ reverses the gap from positive to negative across all five seeds. A per‑sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR‑100 above the dense baseline, identifying batch‑axis dispatch as the dominant failure mode in per‑sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse‑moe‑vision‑rho.

Authors:Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi
Title: GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
Abstract:
Post‑training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL‑based post‑training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier‑backed environments. The latter has dominated recent reasoning‑oriented post‑training because it delivers stronger gains and higher efficiency on domain‑specific tasks (e.g., reasoning). However, although in‑domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open‑ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3‑4B‑Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about 46× less data and 68× less compute than a strong in‑domain RLVR baseline. The resulting model is even competitive with Qwen's released post‑trained models which required a much larger training cost. Notably, a subsequent in‑domain RLVR stage brings only selective gains, mainly on harder competition‑math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post‑trained models. Our code and data will be available at: \hrefhttps://github.com/SJY8460/GRLOhttps://github.com/SJY8460/GRLO.

Authors:Arsha Nagrani, Jasper Uijilings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A Ross, Cordelia Schmid
Title: Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Abstract:
Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva‑Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high‑quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi‑step multimodal questions and spatiotemporally‑dense human‑annotated reasoning traces. Benchmarking experiments show that state‑of‑the‑art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva‑Ego can be downloaded at https://github.com/google‑deepmind/neptune.

Authors:Jiachen Jiang, Huminhao Zhu, Zhihui Zhu
Title: SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution
Abstract:
LLM‑driven program evolution has emerged as a powerful tool for automated scientific discovery, yet existing frameworks offer no principled guide for designing their individual components and provide no guarantee that the search converges. We introduce SMCEvolve, which recasts program search as sampling from a reward‑tilted target distribution and approximates it with a Sequential Monte Carlo (SMC) sampler. From this view, three core mechanisms emerge as principled components: adaptive parent resampling, mixture of mutation with acceptance, and automatic convergence control. We further provide a finite‑sample complexity analysis that bounds the LLM‑call budget required to reach a target approximation error. Across math, algorithm efficiency, symbolic regression, and end‑to‑end ML research benchmarks, SMCEvolve surpasses state‑of‑the‑art evolving systems while using fewer LLM calls under self‑determined termination. The code is available at https://github.com/kongwanbianjinyu/SMCEvolve.

Authors:Gideon Popoola, John Sheppard
Title: GESD: Beyond Outcome-Oriented Fairness
Abstract:
Machine learning (ML) algorithms are increasingly deployed in high‑stakes decision‑making domains such as loan approvals, hiring, and recidivism predictions. While existing fairness metrics (e.g., statistical parity, equal opportunity) effectively quantify outcome‑oriented disparities, they offer limited insight into the procedure or explanation behind biased decisions. To address this gap, we propose Group‑level Explanation Stability Disparity (GESD), a procedural‑oriented fairness metric that measures disparities in the stability, robustness, and sensitivity of model explanations across different subgroups in a protected category. %GESD is explainer‑agnostic, model‑agnostic, and extends the scope of fairness analyses to the level of explainability. We further integrate GESD into a multi‑objective optimization framework that jointly optimizes for utility, outcome‑based fairness, and explanation‑based fairness called FEU (Fairness‑‑Explainability‑‑Utility). Empirical results on multiple benchmark datasets show that GESD effectively captures group‑wise discrepancies in explanation quality, and that FEU improves both utility and fairness over state‑of‑the‑art methods. By bridging outcome‑based and explanation‑based fairness, GESD offers a comprehensive tool for diagnosing and mitigating bias in predictive modeling. Our code and datasets are available on GitHub \hyperlinkhttps://github.com/horlahsunbo/GESDhttps://github.com/horlahsunbo/GESD

Authors:Qiang Liu, Felix Koehler, Benjamin Holzschuh, Nils Thuerey
Title: Tadpole: Autoencoders as Foundation Models for 3D PDEs with Online Learning
Abstract:
We introduce Tadpole, a novel foundation model for three‑dimensional partial differential equations (PDEs) that addresses key challenges in transferability, scalability to high dimensionality, and multi‑functionality. Tadpole is pre‑trained as an autoencoder on synthetic 3D PDE data generated by an efficient online data‑generation framework. This enables large‑scale, diverse training without storage or I/O overhead, demonstrated by scaling to an equivalent of hundreds of terabytes of training data. By autoencoding single‑channel spatial crops, Tadpole learns rich and transferable representations across heterogeneous physical systems with varying numbers of state variables and spatial resolutions. Although pre‑trained solely as an autoencoder, Tadpole can be efficiently applied for multiple downstream tasks beyond reconstruction, including dynamics learning and generative modeling. For dynamics learning, we propose a novel parameter‑efficient fine‑tuning strategy that integrates low‑rank adaptation, latent‑space transformations, and reintroduced skip connections, achieving accurate temporal modeling with a minimal number of trainable parameters. Tadpole demonstrates strong fine‑tuning performance across various downstream tasks, highlighting its versatility and effectiveness as a foundation model for 3D PDE learning. Source code and pre‑trained weights of Tadpole are available at https://github.com/tum‑pbs/tadpole

Authors:Xujia Chen, Xinyue Hu, Letian Chen, Daming Shi, Wenhui Fan
Title: Curriculum Learning of Physics-Informed Neural Networks based on Spatial Correlation
Abstract:
Physics‑Informed Neural Networks (PINNs) combine deep learning with physical constraints for solving partial differential equations (PDEs), and are widely applied in fluid mechanics, heat transfer, and solid mechanics. However, PINN training still suffers from high‑dimensional non‑convex loss landscapes, imbalanced multiobjective constraints, and ineffective information propagation. Existing curriculum learning and causality‑guided strategies improve training stability, but mainly focus on temporal or parametric progression, lacking explicit treatment of spatial information propagation and inter‑region consistency. Moreover, they are not directly applicable to boundary value problems (BVPs) with strong spatial coupling. To address this issue, we propose a spatially correlated curriculum learning framework for PINNs. To the best of our knowledge, this is the first work to address PINN training difficulties from the perspective of spatial coupling among subregions. First, spatial causal weights guide information from near‑boundary regions inward, reducing optimization failures and spurious convergence. Second, a low‑frequency information bridge enforces pseudo‑label‑based consistency across spatially separated regions, suppressing global low‑frequency drift. Third, a region‑adaptive reweighting strategy adjusts subregion losses to reduce local residuals and recover high‑frequency details. Experiments on PDE benchmarks show that, under comparable computational cost, the proposed method alleviates training failures and improves solution accuracy. The code is available at https://github.com/pigofmomo/CurriculumLearningPINN.

Authors:Fanxu Meng
Title: GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Abstract:
Multi‑head Latent Attention (MLA), the attention used in DeepSeek‑V2/V3, jointly compresses keys and values into a low‑rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path ‑ an absorbed MQA form ‑ which ties efficient inference to H100‑class compute‑bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi‑Token Prediction (MTP) gain on commodity inference GPUs such as the export‑restricted H20. We propose Group‑Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA‑absorb path identical to MLA's, and a GQA path with a per‑group expanded cache. The runtime picks the path that matches the target hardware ‑ no retraining, no custom kernels ‑ so a single set of GQLA weights pins the rooflines of both H100 (MQA‑absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8‑way zero‑redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA‑3‑8B it compresses the per‑token KV cache to 28.125% of the GQA baseline on the MQA‑absorb path while structurally preserving GQA‑level traffic on the per‑group path.

Authors:Yi Xie, Siao Liu, Falong Fan, Yuanqi Yao, Yue Zhao, Bo Liu
Title: TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
Abstract:
Multi‑agent LLM systems have shown promise for complex reasoning, yet recent evaluations reveal they often underperform single‑model baselines. We identify a structural failure mode in sequential fine‑tuning of shared‑context teams: updating one agent shifts the team's context distribution, and when subsequent updates are evaluated on cached rollouts, this mismatch compounds. We formalize this as the compounding occupancy shift and prove that stale‑occupancy evaluation incurs a penalty that scales quadratically with the number of agents. In contrast, intermediate‑occupancy evaluation reduces this to linear scaling. We propose TeamTR, a trust‑region framework that resamples trajectories after each component update and enforces per‑agent divergence control, yielding rigorous per‑update and per‑stage improvement lower bounds. Experiments show that TeamTR outperforms single‑agent and sequential baselines with 7.1% on average, mitigates coordination regressions, and supports plug‑and‑play component replacement. Code is available at https://github.com/Yydc/TeamTR.

Authors:Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, Hamed Haddadi
Title: AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
Abstract:
Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi‑step tasks such as coding or web‑based question answering. While remote, cloud‑based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage‑based fees. However, agentic workflows are far more resource‑intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM‑based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single‑inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low‑cost execution signals, such as token‑level log probabilities, AgentStop can reduce wasted energy by 15‑20% with minimal impact on task performance (<5% utility drop) for challenging web‑based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy‑preserving LLM agents on user devices. Our project code and data are available at https://github.com/brave‑experiments/AgentStop.

Authors:ML Nissen Gonzalez, Melwina Albuquerque, Laurence Wroe, Jacob Meyer Cohen, Logan Riggs Smith, Thomas Dooms
Title: When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
Abstract:
Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out‑of‑distribution mechanisms, or basis‑dependent parameters, meaning they disregard weight‑space symmetries. To address these issues for the class of tensor‑based models, we introduce a weight‑based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross‑layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.

Authors:Chenyu Lian, Hong-Yu Zhou, Jing Qin
Title: Evidential Reasoning Advances Interpretable Real-World Disease Screening
Abstract:
Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region‑level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence‑aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post‑hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real‑world disease screening, yielding notably higher specificity at clinical‑level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

Authors:Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K., Nobin Sarwar, Rajat Ghosh, Anwai Archit, Surbhi Motghare, Christina Conrad Parry, Luis Oala, Lara Grosso, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Eric S. Rosenthal, Marzyeh Ghassemi, Matthew McDermott, Tom Pollard
Title: Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets
Abstract:
Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON‑LD‑based format that makes dataset discovery, automated ingestion, and reproducible analysis machine‑checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high‑value data ML increasingly relies on. We release Croissant Baker, a local‑first, open‑source command‑line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC‑IV at 886 million rows and 374 Parquet files. On held‑out comparisons against producer‑authored or standards‑derived ground truth, Croissant Baker reaches 97‑100% agreement across multiple domains.

Authors:Thomas Witt
Title: XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference
Abstract:
We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per‑channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed‑expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically ‑‑ no Hessian, no calibration data, no manual bit‑width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub‑byte index tensor into a per‑group learned codebook. Two storage modes share one auto‑select frontend and one fused decode kernel: V2 (per‑channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5‑122B‑A10B under V2, XFP reaches 138 tok/s single‑stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict‑match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H‑Process: a quality‑driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator‑set thresholds, an OOM boundary at quantize‑on‑load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5‑397B‑A17B (512 routed experts/layer), the H‑Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long‑output decode at 66.72% GSM8K strict‑match on the full 1319‑problem set (single seed at submission; multi‑seed evaluation in progress), exceeding INT4 with routed‑expert pruning on memory, throughput, and accuracy simultaneously.

Authors:Yan Jiang, Ruihong Qiu, Zi Huang
Title: GFMate: Empowering Graph Foundation Models with Test-time Prompt Tuning
Abstract:
Graph prompt tuning has shown great potential in graph learning by introducing trainable prompts to enhance the model performance in conventional single‑domain scenarios. Recent research has extended graph prompts to improve Graph Foundation Models (GFMs) by few‑shot tuning auxiliary prompts. Despite their progress, most existing methods embed source‑domain information into prompts, which serve either as input to GFMs or encoded during model pre‑training. Such prompt entanglement with specific source domains and GFM pre‑training strategy restricts their generalisability to other domains and different GFMs. Furthermore, existing GFM prompts merely rely on few‑shot tuning for adaptation, neglecting the rich information in unlabelled target domain test data. Motivated by these insights, this paper aims to empower GFMs with pre‑training‑agnostic test‑time graph prompt tuning, named GFMate. GFMate introduces centroid and layer prompts applied after pre‑training on target domains, avoiding entanglement with specific source domains and model pre‑training. In addition, a test‑time complementary learning objective is devised to exploit both labelled and unlabelled target domain data for effective test‑time prompt tuning. Extensive experiments on 12 benchmark datasets demonstrate the superior performance and efficiency of GFMate, achieving improvements of up to 30.63%. Code is available at https://github.com/YanJiangJerry/GFMate.

Authors:William Lugoloobi, Samuelle Marro, Jabez Magomere, Joss Wright, Chris Russell
Title: Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
Abstract:
As LLM‑based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \hrefhttps://github.com/KabakaWilliam/known_actionshere.

Authors:Byeongchan Kim, Min-hwan Oh
Title: Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning
Abstract:
We propose a model‑free offline multi‑step reinforcement learning (RL) algorithm, Conservative Peng's Q(λ) (CPQL). Our algorithm adapts the Peng's Q(λ) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a multi‑step operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over‑pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near‑optimal performance guarantees ‑‑ a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single‑step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline‑to‑online learning framework. Using the Q‑function pre‑trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine‑tuning and to attain robust performance improvements. Our code is available at https://github.com/oh‑lab/CPQL.

Authors:Qirui Liu, Hao Chen, Weijie Shi, Jiajie Xu, Jia Zhu
Title: Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions
Abstract:
Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long‑tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado‑large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large‑scale data synthesis, we propose a two‑stage knowledge distillation framework that mines high‑value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual‑layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty‑adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter‑class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP‑Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross‑topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine‑tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.

Authors:Davide Scassola, Andrea Coser, Sebastiano Saccani
Title: ReMIA: a Powerful and Efficient Alternative to Membership Inference Attacks against Synthetic Data Generators
Abstract:
Tabular data sharing under privacy constraints is increasingly important for research and collaboration. Synthetic data generators (SDGs) are a promising solution, but synthetic data remains vulnerable to attacks, such as membership inference attacks (MIAs), which aim to determine whether a specific record was part of the training data. State‑of‑the‑art MIAs are powerful but impractical: they rely on shadow modeling, requiring hundreds of SDG training runs, and need auxiliary data several times larger than the original training set. Fast proxy metrics like distance to closest record (DCR) are efficient but have limited sensitivity to MIA risk. We introduce ReMIA (Relative Membership Inference Attack), a practical privacy metric that requires only two SDG training runs and additional data no larger than the original training set. Rather than predicting whether a record was in the training set, ReMIA generates two synthetic datasets from two source datasets and measures whether a classifier can identify which source a record came from. Experiments across multiple tabular datasets and SDGs show that ReMIA has a sensitivity comparable to state‑of‑the‑art MIAs while being substantially more practical. We further observe that SDGs can achieve privacy‑utility trade‑offs that traditional noise‑based anonymization methods do not match. Code is available at https://github.com/aindo‑com/remia.

Authors:Nabil Iqbal, T. Anderson Keller, Yue Song, Takeru Miyato, Max Welling
Title: Spontaneous symmetry breaking and Goldstone modes for deep information propagation
Abstract:
In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone‑like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long‑term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long‑sequence modeling tasks.

Authors:Jaemin Seo, Surin Lee, Jae Yong Lee
Title: Unbiased and Second-Order-Free Training for High-Dimensional PDEs
Abstract:
Deep learning methods based on backward stochastic differential equations (BSDEs) have emerged as competitive alternatives to physics‑informed neural networks (PINNs) for solving high‑dimensional partial differential equations (PDEs). By leveraging probabilistic representations, BSDE approaches can avoid the curse of dimensionality and often admit second‑order‑free training objectives that do not require explicit Hessian evaluations. It has recently been established that the commonly used Euler‑Maruyama (EM) time discretization induces an intrinsic bias in BSDE training losses. While high‑order schemes such as Heun can fully eliminate this bias, such schemes re‑introduce second‑order spatial derivatives and incur substantial computational overhead. In this work, we provide a principled analysis of EM‑induced loss bias and propose an unbiased, second‑order‑free training framework that preserves the computational advantages of BSDE methods. Our code is available at https://github.com/seojaemin22/Un‑EM‑BSDE.

Authors:Tianfang Zhu, Ning An, Rui Wang, Jiasi Gao, Qingming Luo, Anan Li, Guyue Zhou
Title: Let Robots Feel Your Touch: Visuo-Tactile Cortical Alignment for Embodied Mirror Resonance
Abstract:
Observing touch on another's body can elicit corresponding tactile sensations in the observer, a phenomenon termed mirror touch that supports empathy and social perception. This visuo‑tactile resonance is thought to rely on structural correspondence between visual and somatosensory cortices, yet robotic systems lack computational frameworks that instantiate this principle. Here we demonstrate that cortical correspondence can be operationalized to endow robots with mirror touch. We introduce Mirror Touch Net, which imposes semantic, distributional and geometric alignment between visual and tactile representations through multi‑level constraints, enabling prediction of millimetre‑scale tactile signals across 1,140 taxels on a robotic hand from RGB images. Manifold analysis reveals that these constraints reshape visual representations into geometry consistent with the tactile manifold, reducing the complexity of cross‑modal mapping. Extending this alignment framework to cross‑domain observations of human hands enables tactile prediction and reflexive responses to observed human touch. Our results link a neural principle of visuo‑tactile resonance to robotic perception, providing an explainable route towards anticipatory touch and empathic human‑robot interaction. Code is available at https://github.com/fun0515/Mirror‑Touch‑Net.

Authors:Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal
Title: Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
Abstract:
Context. Behaviour‑Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within‑file Background, within‑repo reusable‑scenario invocation, cross‑organisational shared higher‑level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction‑worthy), pre‑map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L‑step window (L in [2, 18]) in a 339‑repository / 276‑upstream‑owner Gherkin corpus is keyed by paraphrase‑robust cluster identifiers and counted under three scopes. Sentence‑BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density‑Based Clustering (HDBSCAN) recovers paraphrase‑equivalent slices. Three authors label a stratified 200‑slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction‑worthy classifier trained under 5‑fold cross‑validation is compared with a tuned rule baseline and two open‑weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three‑author Fleiss' kappa = 0.56 (extraction‑worthy) and 0.79 (mechanism). The classifier reaches out‑of‑fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e‑4). 75.0%, 59.5%, and 11.7% of scenarios carry a within‑file Background, within‑repo reusable‑scenario, or cross‑organisational shared‑step candidate. Conclusion. Paraphrase‑robust subscenario discovery yields a corpus‑wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache‑2.0.

Authors:Hao Li, Lu Zhang, Liu Chong, Yankai Chen, Pengyang Wang, Yingjie Zhou
Title: SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
Abstract:
Instance normalization (IN) is widely used in non‑stationary multivariate time series forecasting to reduce distribution shifts and highlight common patterns across samples. However, IN can over‑smooth instance‑specific structural information that is essential for modeling temporal and cross‑channel heterogeneity. While prior methods further suppress distribution discrepancies or attempt to recover temporal specific dependencies, they often ignore a central tension: how to adaptively model common and instance‑specific dependency based on each instance's non‑stationary structures. To address this dilemma, we propose SeesawNet, a unified architecture that dynamically balances common and instance‑specific dependency modeling in both temporal and channel dimensions. At its core is Adaptive Stationary‑Nonstationary Attention (ASNA), which captures common dependencies from normalized sequences and specific dependencies from raw sequences, and adaptively fuses them according to instance‑level non‑stationarity. Built upon ASNA, SeesawNet alternates dedicated temporal and channel relationship modeling to jointly capture long‑range and cross‑variable dependencies. Extensive experiments on multiple real‑world benchmarks demonstrate that SeesawNet consistently outperforms state‑of‑the‑art methods.

Authors:Shuqi Gu, Yongxiang Zhao, Baoyu Jing, Kan Ren
Title: What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
Abstract:
Time series forecasting has become increasingly critical in real‑world scenarios, where future sequences are influenced not only by historical patterns but also by forthcoming events. In this context, forecasting must dynamically adapt to complex and stochastic future conditions, which introduces fundamental challenges in both forecasting and evaluation. Traditional methods typically rely on historical data or factual future conditions, while overlooking counterfactual scenarios. Furthermore, many existing approaches are restricted to simple structured conditions, limiting their ability to generalize to the real‑world complexities. To address these gaps, we introduce the task of counterfactual time series forecasting with textual conditions, enabling more flexible and condition‑aware forecasting. We propose a comprehensive evaluation framework that encompasses both factual and counterfactual settings, even in the absence of ground truth time series. Additionally, we present a novel text‑attribution mechanism that distinguishes mutable from immutable factors, thereby improving forecast accuracy under sophisticated and stochastic textual conditions. The project page is at https://seqml.github.io/TADiff/

Authors:Qazi Mamunur Rashid, Xuan Yang, Zhengzhe Yang, Yanzhou Pan, Erin van Liemt, Darlene Neal, Kshitij Pancholi, Jamila Smith-Loud
Title: NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
Abstract:
Recent advancements in generative AI facilitate large‑scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence‑grounded methodology that generates socially relevant synthetic queries by leveraging a fine‑tuned taxonomy generator (TaG) anchored in real‑world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human‑authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama‑Guard‑3). We open‑source our end‑to‑end research prototype and datasets to enable scalable, high‑stakes model evaluation and targeted safety interventions (https://github.com/google‑research/nodesynth).

Authors:Zhengjia Zhong, Shuyan Ke, Zaizhou Lin, Jiaqi Song, Hongyi Lan, Hui Li
Title: RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression
Abstract:
Vector quantization is a fundamental tool for compressing high‑dimensional embeddings, yet existing multi‑codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ‑MoE), a framework combining a two‑level MoE with dual‑stream quantization to enable input‑dependent codebook adaptation for efficient vector quantization. RQ‑MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ‑MoE, and derive a guideline for setting expert dimensionality in RQ‑MoE. Extensive experiments show that RQ‑MoE achieves state‑of‑the‑art or on‑par performance in reconstruction and retrieval, while providing 6x‑14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ‑MoE.

Authors:Jessica Rumbelow
Title: Exemplar Partitioning for Mechanistic Interpretability
Abstract:
We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with ~ 10^3× fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader‑clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head‑to‑head benchmark. In Gemma‑2‑2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction‑tuned Gemma concentrates in a region whose exemplar ablation can collapse held‑out refusal. Cross‑checkpoint matching between base and instruction‑tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: ~ 20% of EP regions match an SAE feature at F_1 > 0.5, and EP one‑hot probes retain ~ 97% of raw‑activation probe accuracy at \ell_0 = 1. Nearest‑exemplar distance provides a free out‑of‑distribution signal at inference. On AxBench latent concept detection at Gemma‑2‑2B‑it L20, EP at p_1 reaches mean AUROC 0.881, +0.126 over the canonical GemmaScope SAE leaderboard entry and within 0.030 of SAE‑A's 0.911, at ~ 10^3× less build compute.

Authors:Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin
Title: Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor
Abstract:
KV‑cache compression at small budgets is a crowded design space spanning cache representation, head‑wise routing, compression cadence, decoding behavior, and within‑budget scoring. We study seven mechanisms across these five families under matched mean cache on long‑form mathematical reasoning (MATH‑500~\citehendrycks2021math) with two distilled‑reasoning models (Qwen‑7B and Llama‑8B variants of DeepSeek‑R1‑Distill~\citedeepseek2025r1) at budgets b \in \64, 128\. All seven were rejected. We then propose α, a one‑function modification to the TriAttention~\citemao2026triattention retention scorer that replaces argmax‑top‑k with greedy facility‑location‑inspired selection under a V‑space redundancy penalty controlled by a single weight λ. A pre‑registered protocol tunes λ on a frozen development split and confirms on a disjoint held‑out split; with λ= 0.5, α clears Bonferroni on two of the four (model, budget) cells (Qwen b=128 and Llama b=64), no cell is significantly negative, and the pre‑registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched‑memory, sympy‑graded, held‑out confirmation protocol is the evidence standard that made the asymmetry visible.

Authors:Weisen Jiang, Shuhao Chen, Sinno Jialin Pan
Title: MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
Abstract:
Mixture‑of‑Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy‑preserving framework that unifies independently trained, domain‑specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity‑aware proxy selection, which selects client‑domain‑relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context‑aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy‑preserving MoE unification methods. Code is available at https://github.com/ws‑jiang/MetaMoE.

Authors:Kai Sun, Peibo Duan, Yongsheng Huang, Guowei Zhang, Benjamin Smith, Nanxu Gong, Levin Kuhlmann
Title: Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks
Abstract:
Spiking neural networks (SNNs), which are brain‑inspired and spike‑driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter‑temporal self‑distillation, implicitly assuming that per‑timestep predictions should be treated equally. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics. To address this issue, we propose Selective Alignment Knowledge Distillation (SeAl‑KD), which selectively aligns class‑level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter‑timestep similarity. Extensive experiments on static image and neuromorphic event‑based datasets demonstrate consistent improvements over existing distillation methods. The code is available at https://github.com/KaiSUN1/SeAl

Authors:Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani
Title: AudioMosaic: Contrastive Masked Audio Representation Learning
Abstract:
Audio self‑supervised learning (SSL) aims to learn general‑purpose representations from large‑scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre‑training. We introduce AudioMosaic, a contrastive learning‑based audio encoder for general audio understanding. During pre‑training, AudioMosaic constructs positive pairs by applying structured time‑frequency masking to spectrogram patches, which reduces memory usage and enables efficient large‑batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance‑level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state‑of‑the‑art performance on several standard audio benchmarks under both linear probing and fine‑tuning. We further show that integrating the pretrained AudioMosaic encoder into audio‑language models improves performance on audio‑language tasks. The code is publicly available in our \hrefhttps://github.com/HanxunH/AudioMosaicGitHub repository.

Authors:Sanghyeob Song, Donghyeok Lee, Jinsik Kim, Sungroh Yoon
Title: R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
Abstract:
For reinforcement learning in data‑scarce domains like real‑world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation‑level instability in Self‑Predictive Learning (SPL) under high Update‑to‑Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero‑centering conflicts with SPL's spectral properties and design a non‑centered objective accordingly. We verify R2R2 on SPL‑native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state‑of‑the‑art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2‑SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2‑SPL, which itself establishes a new state‑of‑the‑art. The code can be found at: https://github.com/songsang7/R2R2

Authors:Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih
Title: Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Abstract:
Autonomous language‑model agents are increasingly evaluated on long‑horizon tool‑use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider‑Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial‑and‑error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation‑and‑selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand‑written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist‑in‑the‑loop solution.

Authors:Jiaqi Liu, Xinyu Ye, Peng Xia, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao
Title: EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
Abstract:
Long‑term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer‑generation policies remain frozen at deployment. We argue that truly adaptive memory requires co‑evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self‑evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM‑powered diagnosis module. In each evolution round, the module reads per‑question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta‑analyzer applies them with automatic revert‑on‑regression and explore‑on‑stagnation safeguards. This closed‑loop self‑evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self‑evolution process captures universal retrieval principles rather than benchmark‑specific heuristics. Code is available at https://github.com/aiming‑lab/SimpleMem.

Authors:Stuart Bladon, Brinnae Bent
Title: Feature Visualization Recovers Known Cortical Selectivity from TRIBE v2
Abstract:
Brain encoder models predict cortical fMRI responses from the internal activations of pretrained vision and language networks, and are typically evaluated by held‑out prediction accuracy. This is a useful signal for training but a poor one for interpretation: it tells us an encoder fits the data without telling us whether it has internalized the functional organization of the brain. We propose feature visualization ‑‑ gradient ascent on the encoder's predicted activation for a target region of interest (ROI) ‑‑ as a complementary interpretability technique, and apply it to TRIBE v2 composed with V‑JEPA 2 (ViT‑G, 40 layers), holding both frozen and synthesizing still images for seven regions spanning the ventral and dorsal visual hierarchies. Under identical hyperparameters, the probe recovers a visible progression of increasing spatial scale and feature complexity across V1 to V4, matching the ventral‑stream hierarchy. It also produces three distinctive downstream regimes: radial "frozen‑motion" streaks for the middle temporal area (MT) despite static‑only optimization, face‑like features for the fusiform face area (FFA), and consistent rectilinear line patterns for the parahippocampal place area (PPA). Optimized FFA stimuli drive the predicted region ~4x as much as a natural face photograph, consistent with feature visualization producing adversarial super‑stimuli rather than canonical exemplars. The probe is simple, differentiable, and applicable to any brain encoder with a differentiable backbone, allowing for qualitative evaluation of brain encoders.

Authors:Hassan Keshvarikhojasteh, Josien P. W. Pluim, Mitko Veta
Title: Attention-Based Multimodal Survival Prediction with Cross-Modal Bilinear Fusion
Abstract:
We propose a novel multimodal deep learning framework for patient‑level survival prediction, which integrates whole‑slide histology features, RNA‑seq expression profiles, and clinical variables. Our architecture combines an ABMIL module~\citeilse2018attention for slide‑level representation with feedforward encoders for RNA and clinical data. These embeddings are then integrated through low‑rank bilinear cross‑modal fusion~\citeliu2018efficient to model conditional interactions across modalities while controlling parameter growth. The model outputs continuous risk scores that are subsequently mapped to survival times using a nonparametric calibration procedure based on the Kaplan‑‑Meier estimator~\citekaplan1958nonparametric. By decomposing multimodal reasoning into independent pairwise interactions, the proposed fusion design promotes structural interpretability and parameter efficiency compared with full tensor and hierarchical fusion strategies. Experiments on the CHIMERA challenge dataset demonstrate improved predictive performance over concatenation‑based baselines and competitive generalization on hidden evaluation cohorts. These results indicate that the proposed framework is a promising approach for multimodal survival prediction in HR‑NMIBC. The implementation is publicly available at https://github.com/hassancpu/ChimeraChallenge2025_Task_3.

Authors:Dongxia Liu, Jie Ma, Xiaochen Yang, Jiancheng Zhang, Bin Xia, Zhehan Kan, Nisha Huang, Jun Liang, Wenming Yang, Jin Li
Title: MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
Abstract:
The creation of cinematic‑quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor‑intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high‑fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high‑fidelity animal videos from coarse meshes under multimodal guidance. We propose Role‑Aware RoPE (RAR‑RoPE) which employs role‑based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high‑quality training data, we introduce MoZoo‑Data, a synthetic‑to‑real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large‑scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh‑video pairs. Experimental results demonstrate that MoZoo achieves high‑fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.

Authors:Ido Sobol, Kihyuk Sohn, Yoav Blum, Egor Zakharov, Max Bluvstein, Andrea Vedaldi, Or Litany
Title: Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
Abstract:
We often aim to generate images that are both photorealistic and 3D‑consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine‑tuning an image generator, pre‑trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co‑variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion‑based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text‑to‑multiview generation and texturing from 3D inputs, producing outputs that are 3D‑consistent and photorealistic.

Authors:Zijie Wu, Lixin Xu, Puhua Jiang, Sicong Liu, Chunchao Guo, Xiang Bai
Title: R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
Abstract:
Video‑guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real‑world scenarios, the initial pose of a user‑provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R‑DMesh), a unified framework designed to generate high‑fidelity 4D meshes that are ``rectified'' to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video's initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex‑wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow‑based Diffusion Transformer conditioned on pre‑trained video latents, effectively transferring rich spatio‑temporal priors to the 3D domain. To support this task, we construct Video‑RDMesh, a large‑scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R‑DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.

Authors:Dongzhe Zheng, Tao Zhong, Christine Allen-Blanchette
Title: Topology-Preserving Neural Operator Learning via Hodge Decomposition
Abstract:
In this paper, we study solution operators of physical field equations on geometric meshes from a function‑space perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure‑preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator‑level decomposition. The result is a Hybrid Eulerian‑Lagrangian architecture with an algebraic‑level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology‑dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants. Our code is available at https://github.com/ContinuumCoder/Hodge‑Spectral‑Duality

Authors:Jascha Wanger
Title: VectorSmuggle: Steganographic Exfiltration in Embedding Stores and a Cryptographic Provenance Defense
Abstract:
Modern retrieval‑augmented generation (RAG) systems convert sensitive content into high‑dimensional embeddings and store them in vector databases that treat the resulting numerical artifacts as opaque. Major vector‑store products do not provide native controls for embedding integrity, ingestion‑time distributional anomaly detection, or cryptographic provenance attestation. We show this opens a class of steganographic exfiltration attacks: an attacker with write access to the ingestion pipeline can hide payload data inside embeddings using simple post‑embedding perturbations (noise injection, rotation, scaling, offset, fragmentation, and combinations thereof) while preserving the surface‑level retrieval behavior the RAG system exposes to legitimate users. We evaluate these techniques across a synthetic‑PII corpus on text‑embedding‑3‑large, four locally hosted open embedding models, a cross‑corpus replication on BEIR NFCorpus and a Quora subset (over 26,000 chunks combined), seven vector‑store configurations, an adaptive‑attacker variant of the detector evaluation, and a paraphrased‑query retrieval benchmark. Distribution‑shifting perturbations are often caught by simple anomaly detectors; small‑angle orthogonal rotation defeats distribution‑based detection across every (model, corpus) pair tested. A disjoint‑Givens rotation encoder gives a closed‑form per‑vector capacity ceiling of floor(d/2) b bits, but real embedding manifolds impose a capacity‑detectability trade‑off, and the retrieval‑preserving operating point sits well below it. We propose VectorPin, a cryptographic provenance protocol that pins each embedding to its source content and producing model via an Ed25519 signature over a canonical byte representation. Any post‑embedding modification breaks signature verification. Embedding‑level integrity is a deployable, standardizable control that closes this attack class.

Authors:Valentin Six, Frederik Panse, Mathis Fajeau, Lancelot Da Costa, Mridul Sharma, Alfonso Amayuelas, Tim Z. Xiao, David Hyland, Philipp Hennig, Bernhard Schölkopf
Title: Learning POMDP World Models from Observations with Language-Model Priors
Abstract:
Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially‑observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation‑action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language‑model priors can reduce costly interaction by leveraging prior knowledge, and introduce \emphPinductor (POMDP‑inductor): an LLM proposes candidate POMDP models from a few observation‑action trajectories and iteratively refines them to optimize a belief‑based likelihood score. Despite using strictly less information, \emphPinductor matches the performance and sample efficiency of LLM‑based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language‑model priors as a practical tool for sample‑efficient world‑model learning under partial observability, and a step toward generalist agents in real‑world environments. Code is available at https://github.com/atomresearch/pinductor.

Authors:Haaris Mehmood, Giorgos Tatsis, Dimitrios Alexopoulos, Karthikeyan Saravanan, Jie Xu, Anastasios Drosou, Mete Ozay
Title: DisAgg: Distributed Aggregators for Efficient Secure Aggregation in Federated Learning
Abstract:
Federated learning enables collaborative model training across distributed clients, yet vanilla FL exposes client updates to the central server. Secure‑aggregation schemes protect privacy against an honest‑but‑curious server, but existing approaches often suffer from many communication rounds, heavy public‑key operations, or difficulty handling client dropouts. Recent methods like One‑Shot Private Aggregation (OPA) cut rounds to a single server interaction per FL iteration, yet they impose substantial cryptographic and computational overhead on both server and clients. We propose a new protocol called DisAgg that leverages a small committee of clients called Aggregators to perform the aggregation itself: each client secret‑shares its update vector to Aggregators, which locally compute partial sums and return only aggregated shares for server‑side reconstruction. This design eliminates local masking and expensive homomorphic encryption, reducing endpoint computation while preserving privacy against a curious server and a limited fraction of colluding clients. By leveraging optimal trade‑offs between communication and computation costs, DisAgg processes 100k‑dimensional update vectors from 100k 5G clients with a 4.6x speedup compared to OPA, the previous best protocol.

Authors:Cenwei Zhang, Suncheng Xiang, Lei You
Title: MedCore: Boundary-Preserving Medical Core Pruning for MedSAM
Abstract:
Medical segmentation foundation models such as SAM and MedSAM provide strong prompt‑driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM‑to‑MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual‑intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary‑aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression‑induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine‑tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head‑fragile boundary regime: head‑pruning steps have 2.887 times larger 95th‑percentile boundary leverage than MLP‑pruning steps, and this logit‑level effect is consistent with BF1 and HD95 degradation. Our code is available at https://github.com/cenweizhang/MedCore.

Authors:Yatin Dandi, Matteo Vilucchio, Luca Arnaboldi, Hugo Tabanelli, Florent Krzakala
Title: Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning
Abstract:
Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low‑Degree Filtering (Neural LoFi), a stylized limit of gradient‑based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low‑degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel‑space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi‑layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low‑degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random‑feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient‑descent feature discovery with real datasets.

Authors:Gregory Beurier, Robin Reiter, Camille Noûs, Lauriane Rouan, Denis Cornet
Title: Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models
Abstract:
Near‑infrared spectroscopy (NIRS) is rapid and non‑destructive, but reliable calibration still depends heavily on spectral preprocessing. In routine practice, preprocessing is often selected by large external pipeline searches that are costly, unstable on small calibration sets, and difficult to audit. We introduce operator‑adaptive calibration, a framework that moves linear preprocessing selection inside the calibration model. Candidate treatments are encoded as linear spectral operators, while nonlinear or sample‑adaptive corrections such as SNV, MSC, and ASLS are handled as fold‑local branches to prevent leakage. We instantiate the framework for PLS and Ridge regression. For PLS, covariance identities enable fast NIPALS and SIMPLS variants while preserving original‑wavelength coefficients. For Ridge, operator‑adaptive kernels yield a dual formulation with recoverable original‑space coefficients. The approach was evaluated on more than 50 heterogeneous NIRS datasets against conventional PLS, Ridge, CatBoost, and CNN baselines under documented search budgets. Compact operator‑adaptive PLS with ASLS branch preprocessing achieved a median RMSEP/PLS ratio of 0.960 with 42 wins on 57 datasets, while a deployable AOM‑Ridge selector improved over tuned Ridge by a median 2.22% with 35 wins on 52 datasets. The proposed models reduce dependence on large preprocessing‑HPO campaigns, produce traceable operator choices, retain interpretable coefficients, and fit in seconds for compact AOM‑PLS. Operator‑adaptive calibration therefore offers a practical route to faster, more robust, and more auditable NIRS method development.

Authors:Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan
Title: RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
Abstract:
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight‑annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician‑motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30‑min windows and release two datasets: RealICU‑Gold with 930‑window annotations from 94 MIMIC‑IV patients, and RealICU‑Scale with 11,862 windows extended by Oracle, a physician‑validated LLM hindsight labeler. Existing LLMs including memory‑augmented ones performed poorly on RealICU, exposing two failure modes: a recall‑safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU‑Evo to study structured‑memory agents that improves long‑horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision‑support in high‑stakes care. Project page: https://chengzhi‑leo.github.io/RealICU‑Bench/

Authors:Jaeyung Kim, YoungJoon Yoo
Title: ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin
Abstract:
Vector Quantized Variational Autoencoder (VQ‑VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ‑VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ‑VAE (ArcVQ‑VAE), a novel vector quantization framework that introduces a spherical angular‑margin prior (SAMP) for the codebook of a conventional VQ‑VAE. The proposed SAMP consists of Ball‑Bounded Norm Regularization, which constrains all codebook vectors within a time‑dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent‑space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ‑VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ‑VAE

Authors:Marc Molina Van den Bosch, Riccardo Taiello, Albert Sund Aillet, Andrea Protani, Miguel Angel Gonzalez Ballester, Luigi Serio
Title: DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning
Abstract:
Differentially private optimization suffers from a fundamental geometric mismatch: deep networks have highly anisotropic loss landscapes, yet DP‑SGD injects isotropic noise. Second‑order preconditioning can resolve this, but estimating curvature typically requires private data (consuming privacy budget) or public data (introducing distribution shift). We show that the Fisher Information Matrix decouples into architectural sensitivity, recoverable via synthetic noise, and input correlations, approximable from modality‑specific frequency statistics. We propose DP‑KFC, which constructs KFAC preconditioners by probing networks with structured synthetic noise, requiring neither private nor public data. Empirically, DP‑KFC consistently outperforms DP‑SGD and adaptive baselines across diverse modalities in strong privacy regimes (\varepsilon \leq 3). DP‑KFC matches private‑data preconditioners while public‑data variants degrade by up to 4.8%, showing that curvature can be estimated without consuming privacy budget or introducing distribution shift. This enables privacy‑preserving learning in specialized domains (e.g., medical applications) where regulatory constraints make data scarce.

Authors:Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen
Title: Continual Learning with Multilingual Foundation Model
Abstract:
This paper presents a multi‑stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non‑reclamatory usage of LGBTQ+‑related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross‑linguistic variation in sentiment expression. It integrates data‑driven model selection via cross‑validation, semantic‑preserving augmentation through back‑translation, inductive transfer learning with dynamic epoch‑level undersampling, and domain‑specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM‑RoBERTa selected as the foundation model based on macro‑averaged F1 score. Data augmentation via GPT‑4o‑mini back‑translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre‑training, RUN 3 and RUN 4 are previous predictions refined via language‑specific decision thresholds optimized via ROC analysis. Language‑specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold‑based optimization yields 2‑5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg‑research/MultiPRIDE‑Evalita‑2026.

Authors:Namhyoung Kim, Jae Wook Song
Title: Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction
Abstract:
Predicting cross‑sectional stock returns is challenging due to low signal‑to‑noise ratios and evolving market regimes. Classical factor models offer interpretability but limited flexibility, while deep learning models achieve strong performance yet often underutilize financial priors. We address this gap with PRISM‑VQ (PRior‑Informed Stock Model with Vector Quantization), a dynamic factor framework that integrates expert prior factors, vector‑quantized discrete latent factors learned from cross‑sectional structure, and a structure‑conditioned Mixture‑of‑Experts to generate time‑varying factor loadings. Vector quantization acts as an information bottleneck that suppresses noise while capturing robust market structure, with discrete codes serving both as latent factors and as routing signals for temporal expert specialization. Experiments on CSI 300 and S&P 500 show consistent improvements in cross‑sectional return prediction and portfolio performance over strong baselines while preserving interpretability. Our code is available at https://github.com/finxlab/PRISM‑VQ.

Authors:Lilin Zhang, Yimo Guo, Yue Li, Jiancheng Shi, Xianggen Liu
Title: Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
Abstract:
Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real‑world long‑tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long‑tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose RobustLT, a plug‑and‑play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RobustLT consistently enhances adversarial robustness and class‑balance on long‑tailed datasets. The code is available at \hrefhttps://github.com/zhang‑lilin/RobustLThttps://github.com/zhang‑lilin/RobustLT.

Authors:Daniel Matsui Smola
Title: Support-Conditioned Flow Matching Is Kernel Smoothing
Abstract:
Generative models are often conditioned on a small set of examples via cross‑attention. Under the Gaussian optimal‑transport path, we show that the exact velocity field induced by a finite support set is a Nadaraya‑‑Watson kernel smoother whose bandwidth decreases with flow time, from broad averaging at early steps to nearest‑neighbor at late steps. A single Gaussian‑kernel attention head exactly computes this field, connecting cross‑attention conditioning to classical kernel theory. The theory predicts three failure regimes: nearest‑neighbor collapse of the kernel at high dimension, mismatch between the isotropic kernel and the data geometry, and insufficient support for nonparametric estimation. Experiments on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features confirm that learned conditioning improves in precisely these regimes, and that IP‑Adapter's cross‑attention implements approximate NW smoothing in practice.

Authors:Chaehee Song, Minseok Seo, Yeeun Seong, Doyi Kim, Changick Kim
Title: Query-Conditioned Test-Time Self-Training for Large Language Models
Abstract:
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test‑time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test‑time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self‑supervised objectives that lack query‑specific alignment. In this work, we propose Query‑Conditioned Test‑Time Self‑Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem‑‑solution pairs. Based on this, QueST generates such query‑conditioned pairs and uses them as supervision for parameter‑efficient fine‑tuning at test time. The adapted model is then used to produce the final answer, enabling query‑specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA‑Diamond scientific reasoning benchmark, QueST consistently outperforms strong test‑time optimization baselines. These results demonstrate that query‑conditioned self‑training is an effective and practical paradigm for test‑time adaptation in LLMs. Code is available at https://chssong.github.io/Query‑Conditioned‑TTST/.

Authors:Junhyuk Jeon, Seokhyeon Hong, Junyong Noh
Title: Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation
Abstract:
Text‑driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine‑level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text‑driven diffusion model. Existing stylization methods, however, either require style‑specific fine‑tuning of existing models or rely on heavy ControlNet‑based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork‑generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low‑rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization‑based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state‑of‑the‑art stylization results, while achieving improved stylization for unseen styles.

Authors:Amjad Seyedi, Lifang He, Songlin Zhao, Akwum Onwunta, Nicolas Gillis
Title: Supervised Deep Multimodal Matrix Factorization for Interpretable Brain Network Analysis
Abstract:
We present Supervised Deep Multimodal Matrix Factorization (SD3MF), an interpretable framework for integrative brain network analysis that generalizes Symmetric Nonnegative Matrix Tri‑Factorization (SNMTF) from unsupervised single‑graph clustering to supervised prediction over populations of multimodal graphs. SD3MF learns deep hierarchical factorizations for each modality together with a shared latent representation that aligns subjects across views. An encoder‑decoder formulation jointly optimizes graph reconstruction and supervised prediction, while adaptive weights enable data‑driven multimodal fusion. By representing each subject through community‑level interaction matrices, the model yields interpretable and discriminative features. Experiments on multimodal connectome datasets show that SD3MF consistently outperforms strong deep learning baselines such as CNNs and GNNs, while enabling biologically interpretable insights. Code for reproducibility is available at: https://github.com/amjadseyedi/SD3MF.

Authors:Stefan Stojanovic, Alexandre Proutiere
Title: Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning
Abstract:
Hierarchical reinforcement learning can improve generalization by decomposing long‑horizon decision‑making into simpler subproblems. However, existing approaches often rely on restrictive design choices, such as fixed temporal abstractions or goal‑conditioned objectives, which largely confine them to goal‑reaching tasks and limit their applicability to general reward functions. In this paper, we introduce switching successor measures, an extension of successor measures that enables hierarchical control in zero‑shot reinforcement learning without additional supervision, fixed horizons, or manually designed subgoals. We show that switching successor measures arise naturally from classical successor measures while preserving their underlying structure. Building on this result, we propose FB π‑Switch, an algorithm that extracts both a high‑level subgoal‑selection policy and a low‑level control policy directly from forward‑backward (FB) representations, allowing hierarchical behavior to emerge from a single learned representation. Experiments on both goal‑conditioned and general reward‑based tasks show that FB π‑Switch improves over non‑hierarchical baselines and matches state‑of‑the‑art hierarchical methods in goal‑conditioned settings. These results demonstrate that structured successor representations provide a flexible foundation for hierarchical zero‑shot reinforcement learning beyond goal‑reaching tasks. Our project website is available at: https://stestokth.github.io/switching‑successors/.

Authors:Mahsa Gazeran, Sayvan Soleymanbaigi, Fatemeh Daneshfar, Amjad Seyedi, Fardin Akhlaghian Tab
Title: ECG-NAT: A Self-supervised Neighborhood Attention Transformer for Multi-lead Electrocardiogram Classification
Abstract:
Electrocardiogram (ECG) arrhythmia classification remains challenging due to signal variability, noise, limited labeled data, and the difficulty in achieving both accuracy and efficiency in models. While self‑supervised learning reduces label dependency, most methods target either global contextual features or local morphological patterns, but rarely implement hierarchical multi‑scale feature extraction. ECG signals require architectures that simultaneously capture fine‑grained beat‑level morphology and broader rhythm‑level dependencies with computational efficiency. To overcome this limitation, this paper proposes the Electrocardiogram Neighborhood Attention Transformer (ECG‑NAT), a novel self‑supervised learning approach tailored for multi‑lead ECG classification. Our two‑stage approach begins with generative pretraining, using a masked autoencoder to reconstruct partially masked ECG signals across multiple diverse datasets, enabling the model to learn robust, domain‑invariant representations from unlabeled data. This is followed by discriminative fine‑tuning with a dual‑loss function that combines supervised contrastive and cross‑entropy losses, aligning representation learning with label prediction. The hierarchical attention mechanism efficiently captures multi‑scale temporal features from localized beat morphology to broader rhythm patterns at low computational cost. ECG‑NAT achieves robust performance on benchmark datasets, with 88.1% accuracy using only 1% labeled data, demonstrating strong efficacy in low‑resource settings. The framework combines superior classification performance with computational efficiency, making it practical for real‑time ECG diagnosis. The code will be made available upon acceptance at: https://github.com/Mahsagazeran/ECG‑NAT.

Authors:Jiahao Chen, Zihui Zhang, Yafei Yang, Jinxi Li, Shenxing Wei, Zhixuan Sun, Bo Yang
Title: EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
Abstract:
We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real‑world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real‑world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state‑of‑the‑art results.

Authors:David Iagaru, Nina M. Gottschling, Anders C. Hansen, Josselin Garnier
Title: On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods
Abstract:
Artificial intelligence (AI) has transformed imaging inverse problems, from medical diagnostics to Earth observation. Yet deep neural networks can produce hallucinations, realistic‑looking but incorrect details, undermining their reliability, especially when ground truth data is unavailable. We develop a theoretical framework showing that such hallucinations are not merely artifacts of particular models, but can arise from the ill‑posed nature of the inverse problem itself. We derive necessary and sufficient conditions for hallucinations, together with computable bounds on their magnitude that depend only on the forward model. Building on this theory, we introduce algorithms to: (1) estimate the minimum hallucination magnitude achievable by any reconstruction model for a given input; (2) assess the faithfulness of reconstructed details by a given reconstruction model. Experiments across three imaging tasks demonstrate that our approach applies broadly, including to modern generative models, and provides a principled way to quantify and evaluate AI hallucinations.

Authors:Haoning Wang, Wenchao Yang, Shuai Shen, Yang Li
Title: KAST-BAR: Knowledge-Anchored Semantically-Dynamic Topology Brain Autoregressive Modeling for Universal Neural Interpretation
Abstract:
While EEG foundation models have shown significant potential in universal neural decoding across tasks, their advancement remains constrained by the inadequacy modeling of complex spatiotemporal topology, as well as the inherent modality gap between low‑level physiological signals and high‑level textual semantics. To address these challenges, we propose a Knowledge‑Anchored Semantically‑Dynamic Topology Brain Autoregressive Model (KAST‑BAR), which dynamically aligns physiological representations derived from multi‑level brain topology with an expert‑level semantic space. Specifically, we design a Dual‑Stream Hierarchical Attention (DSHA) encoder that accurately captures the brain's intrinsic non‑Euclidean topology by modeling local temporal dynamics with global spatial contexts. On this basis, a Knowledge‑Anchored Semantic Profiler (KASP) is proposed to synthesize physically‑grounded and instance‑level textual profiles, which subsequently drive a Semantic Text‑Aware Refiner (STAR) to dynamically reconstruct EEG representations using Latent Expert Queries. By conducting large‑scale pre‑training on 21 diverse datasets to build a foundation model, KAST‑BAR effectively integrates expert‑level medical knowledge into EEG signal representations, consistently achieving superior performance across six downstream tasks. Our code is available at https://github.com/KAST‑BAR/KAST‑BAR

Authors:Aaditya L. Kachhadiya
Title: Local Inverse Geometry Can Be Amortized
Abstract:
Nonlinear inverse problems often trade inexpensive but fragile first‑order updates against curvature‑aware methods such as Gauss‑Newton and Levenberg‑Marquardt, which obtain stronger directions by repeatedly solving Jacobian‑based linearized systems. We propose a learned alternative: amortize local inverse geometry into a reusable reverse operator. Our framework learns a bidirectional surrogate, Deceptron, and deploys it through D‑IPG (Deceptron Inverse‑Preconditioned Gradient), an iterative solver that pulls residual‑corrected measurement‑space proposals back to latent space. The key mechanism is a Jacobian Composition Penalty (JCP), which trains the reverse Jacobian to act as a local left inverse of the forward Jacobian; its runtime counterpart, RJCP, measures the same inverse‑consistency error along optimization trajectories. We prove that D‑IPG is first‑order equivalent to damped Gauss‑Newton under local pseudoinverse consistency, with deviation controlled by composition error and conditioning. Across seven PDE inverse‑problem benchmarks, D‑IPG outperforms standard baselines, achieves 94.8% mean success across the six‑problem reliability suite, and reaches comparable or better recovery quality at up to 77x lower inference‑time solve cost on the main benchmarks.

Authors:Xu Bai, Bin Lu, Kun Zhang, Shengbo Chen, Xinbing Wang, Chenghu Zhou, Meng Jin
Title: Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle
Abstract:
Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most existing methods rely on pair‑wise similarity matching, where each node independently searches for its best partner based on global information. This selfishness matching paradigm incurs substantial computational and memory overhead. To address this problem, we shift to a non‑selfishness principle that prioritizes the collective interference of neighborhood in coarsening, and propose an efficient method named NOPE, which achieves linear memory consumption and near‑linear computational complexity in the number of nodes. Furthermore, we derive a faster variant NOPE, which reduces O(δ\dot d) interference evaluation to O(d) based on the local isotropy assumption, and consequently alleviates the computational bottleneck for high‑degree nodes. Experimental results show that NOPE achieves 1.8‑10× speedup over NOPE and surpass almost all baselines with 1‑3 orders of magnitude acceleration. Meanwhile, learning on coarsened graphs yields comparable performance to original graphs, and can even show superior performance over LLM‑based graph reasoning owing to compact graph information. The code can be available at https://github.com/dazonglian/NOPE‑main.

Authors:Guiquan Sun, Xikun Zhang, Jingchao Ni, Dongjin Song
Title: DRIFT: A Benchmark for Task-Free Continual Graph Learning with Continuous Distribution Shifts
Abstract:
Continual graph learning (CGL) aims to learn from dynamically evolving graphs while mitigating catastrophic forgetting. Existing CGL approaches typically adopt a task‑based formulation, where the data stream is partitioned into a sequence of discrete tasks with pre‑defined boundaries. However, such assumptions rarely hold in real‑world environments, where data distributions evolve continuously and task identity is often unavailable. To better reflect realistic non‑stationary environments, we revisit continual graph learning from a task‑free perspective. We propose a unified formulation that models the data stream as a time‑varying mixture of latent task distributions, enabling continuous modeling of distribution drift. Based on this formulation, we construct \emphDRIFT, a benchmark that spans a spectrum of transition dynamics ranging from hard task switches to smooth distributional drift through a Gaussian parameterization. We evaluate representative continual learning methods under this task‑free setting and observe substantial performance degradation compared to traditional task‑based protocols. Our findings indicate that many existing approaches implicitly rely on task boundary information and struggle under realistic task‑free graph streams. This work highlights the importance of studying continual graph learning under realistic non‑stationary conditions and provides a benchmark for future research in this direction. Our code is available at https://github.com/UConn‑DSIS/DRIFT.

Authors:John R. Minnick, Jinghui Geng, Kamran Hussain, Jesus Gonzalez-Ferrer, Ash Robbins, Mohammed A. Mostajo-Radji, David Haussler, Jason K. Eshraghian, Mircea Teodorescu
Title: SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
Abstract:
Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation r between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large‑scale benchmark for causal, autoregressive spike‑count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude‑invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non‑diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain‑region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing‑statistics constraints (region ΔR^2 = 0.018 above the firing‑statistics covariates). It also exposes a sub‑Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL‑on‑output‑rates distillation for ANN‑to‑SNN transfer in this Poisson count domain.

Authors:Feijiang Li, Zhenxiong Li, Jieting Wang, Zizheng Jiu, Saixiong Liu, Liang Du
Title: Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering
Abstract:
Image clustering aims to partition unlabeled image datasets into distinct groups. A core aspect of this task is constructing and leveraging prior knowledge to guide the clustering process. Recent approaches introduce semantic descriptions as prior information, most of which typically relying on matching‑based techniques with predefined vocabularies. However, the limited matching space restricts their adaptability to downstream clustering tasks. Moreover, these methods primarily focus on reducing bias to improve performance, frequently overlooking the importance of variance reduction. To address these limitations, we propose GSEC (Image Clustering based on Generative Semantic Guidance and Bi‑Layer Ensemble), a framework designed to reduce bias through generative semantic guidance and mitigate variance via ensemble learning. Our method employs Multimodal Large Language Models to generate semantic descriptions and derive image embeddings via weighted averaging. Additionally, a bi‑layer ensemble strategy integrates cross‑modal information through BatchEnsemble in the inner layer and aligns outputs via an alignment mechanism in the outer layer. Comparative experiments demonstrate that GSEC outperforms 18 state‑of‑the‑art methods across six benchmark datasets, while further analysis confirms its effectiveness in simultaneously reducing both bias and variance. The code is available at https://github.com/2017LI/GSEC.git.

Authors:Haodong Wu, Jiahao Zhang, Lijie Hu, Yongqi Zhang
Title: From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
Abstract:
Supervised fine‑tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top‑k subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed‑pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high‑quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two‑layer solver that decouples fixed‑pool materialization based on cached task‑, data‑, and model‑side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian‑process‑assisted ranking, and stagnation‑triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in‑distribution reasoning average across three base models, outperforming full‑data training, random recipe search, random top‑k, and single‑operator selectors. Additional Out‑of‑distribution graph‑reasoning results, search‑stability analyses, structural ablations, and 1.5B‑to‑7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.

Authors:Zheng Wang, Yuang Liu, Yangkai Ding
Title: Reinforced Collaboration in Multi-Agent Flow Networks
Abstract:
Multi‑agent systems provide a powerful way to extend large language models (LLMs) by decomposing a complex task into specialized subtasks handled by different agents. However, their performance is often hindered by error propagation, arising from suboptimal workflow design or inaccurate agent outputs, which can propagate through the agent collaboration process and degrade final results. To address the challenges, we present MANGO (Multi‑Agent Network Gradient Optimization), a data‑driven framework that organizes and refines agent collaboration via a flow network constructed from past successful workflows. MANGO integrates reinforcement learning and textual gradients to jointly optimize workflow paths and agent behaviors, while a skipping mechanism prevents redundant updates to well‑optimized agents for improving efficiency. Extensive experiments on seven benchmarks show that MANGO achieves up to 12.8% performance improvement over state‑of‑the‑art baselines, enhances efficiency by 47.4%, and generalizes effectively to unseen domains. Our code and datasets are publicly available at https://github.com/openJiuwen‑ai/agent‑store/tree/main/community/mango.

Authors:Rohith Reddy Bellibatlu
Title: RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems
Abstract:
Aggregate accuracy metrics dominate the evaluation of clinical AI decision‑support systems but do not detect deployment‑phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five‑dimension pre‑deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub‑criteria, pre‑specified pass/fail thresholds, and bias‑corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm‑Bonferroni family‑wise error correction. A central demonstration is that a classifier satisfying conventional high‑discrimination benchmarks can simultaneously fail input‑encoding stability and threshold‑shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. We validate this differential pass/fail pattern on a synthetic cohort and three publicly available real‑world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, where failing dimensions differ across cohorts, providing preliminary evidence of construct validity. The Equity dimension is reframed as a proxy‑dependence diagnostic rather than a stand‑alone gate: any need‑based fairness verdict computed against a utilization‑derived proxy carries a construct‑validity problem the framework surfaces explicitly, triggering a procurement requirement for an outcome‑independent need measure before the gate is binding. RISED is released as an open‑source Python package that supplies the quantitative verdicts existing clinical AI reporting standards require, providing a principled gateway between in‑silico model validation and silent‑trial clinical evaluation.

Authors:Blaise Delattre, Hengyu Wu, Paul Caillon, Wei Yang Bryan Lim, Yang Cao
Title: Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing
Abstract:
Randomized smoothing provides strong, model‑agnostic robustness certificates, but existing guarantees are limited to single modalities, treating continuous and discrete inputs in isolation. This limitation becomes critical in multimodal models, where decisions depend on cross‑modal semantics and adversaries can jointly perturb heterogeneous inputs, rendering unimodal certificates insufficient. We introduce a unified randomized smoothing framework for mixed discrete‑‑continuous inputs based on an analytically tractable Neyman‑‑Pearson formulation of the joint worst‑case problem. By analyzing the joint likelihood ordering induced by factorized discrete and continuous noise, our approach yields a closed‑form, one‑dimensional certificate that strictly generalizes both Gaussian (image‑only) and discrete (text‑only) randomized smoothing. We validate the framework on multimodal safety filtering, providing, to our knowledge, the first model‑agnostic Neyman‑‑Pearson certificate for joint discrete‑token and continuous‑image perturbations in interaction‑dependent text‑‑image safety filtering.

Authors:Zhongkai Yu, Yichen Lin, Chenyang Zhou, Yuwei Zhang, Kun Zhou, Junxia Cui, Haotian Ye, Zhengding Hu, Zaifeng Pan, Ruiyi Wang, Yujie Zhao, Hejia Zhang, Jingbo Shang, Jishen Zhao, Yufei Ding
Title: ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation
Abstract:
Existing API‑based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed‑source APIs incompatible with chip vendors' air‑gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self‑trained models address the deployment constraint but remain single‑turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self‑trained multi‑agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross‑comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference‑model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack‑based inference workflow to prevent error propagation across turns, and a two‑stage training pipeline that first trains each agent individually to saturate its code‑generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data‑generation framework that produces 64.4K high‑quality reference model training samples. ChipMATE achieves 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self‑trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

Authors:Divya Sitani
Title: Multitask Multimodal Fusion with Tabular Foundation Models for Peak and Durability Prediction of Pertussis Booster Response
Abstract:
Pertussis booster vaccination produces immune responses that vary widely across individuals in both peak magnitude and long‑term durability. These two phases are governed by partly distinct biological compartments:peak reflects acute B‑cell activation and antibody secretion, while durability reflects the establishment of long‑term humoral memory. Yet most computational models target only one, missing the full boost‑and‑wane trajectory. Jointly predicting both is non‑trivial because the two endpoints are biologically dissociated rather than redundant; samples are small, modalities are heterogeneous with structured missingness, and the two tasks rely on different measurement windows. We propose a multi‑task contrastive multimodal fusion architecture combining frozen TabPFN‑v2 per‑modality encoders, a dual‑label supervised contrastive loss that treats two subjects as a positive pair if they agree on the Task 1 label or the Task 2 label, modality dropout calibrated to empirical missingness, and missingness‑masked attention fusion. Applied to a curated subset of the CMI‑PB pertussis booster dataset (n = 158 subjects, four modalities, 44.9% with at least one modality missing; Spearman r = ‑0.58 between peak and durability, n = 96), the model achieves test AUROC 0.797 (95% CI [0.621, 0.948]) for peak response and 0.755 (95% CI [0.519, 0.945]) for durability, with both significant under joint label permutation (N = 1000; p = 0.002 and p = 0.045). Across logistic regression, XGBoost, and MLP baselines on raw features and on TabPFN embeddings, the proposed model is the only one whose 95% CIs lie above chance on both tasks simultaneously. Per‑modality contribution analyses recover task‑specific modality contributions consistent with the underlying immunology: peak prediction is carried by cytokine signatures, while durability is carried by baseline antibody features.

Authors:Kaixiang Zhao, Bolin Shen, Yuyang Dai, Shayok Chakraborty, Yushun Dong
Title: GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?
Abstract:
Graph neural networks (GNNs) deployed as cloud services can be \emphstolen through \emphmodel‑extraction attacks, which train a surrogate from query responses to reproduce the target's behaviour, and a growing line of ownership defenses tries to prevent or trace such theft. The title of this paper asks two questions: \emphhow hard is it to steal a GNN?, and \emphcan we stop it? Prior work cannot answer either, because experiments use inconsistent datasets, threat models, and metrics. We introduce \emphGraphIP‑Bench, a unified benchmark which evaluates both sides under a single black‑box protocol. It integrates twelve extraction attacks, twelve defenses spanning watermarking, output‑perturbation, and query‑pattern‑detection families, ten public graphs covering homophilic, heterophilic, and large‑scale regimes, three GNN backbones, and three graph‑learning tasks, and it reports fidelity, task utility, ownership verification, and computational cost on shared splits, queries, and budgets. We further add a joint attack‑and‑defense track which runs every attack on every defended target and measures watermark verification on the resulting surrogate, which exposes the protection that a defense retains after extraction. The empirical picture is short: stealing a GNN is easy at medium query budgets and most defenses do not change this; several watermarks verify reliably on the protected model but lose most of their verification signal on the extracted surrogate, which exposes a gap that single‑model evaluations miss; and heterophilic graphs are systematically harder to steal, while a cross‑architecture mismatch between target and surrogate reduces but does not prevent extraction. Code: \hrefhttps://github.com/LabRAI/GraphIP‑BenchLabRAI/GraphIP‑Bench.

Authors:Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal
Title: REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
Abstract:
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt‑based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent‑space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent‑space attack framework. REALISTA constructs an input‑dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing‑based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state‑of‑the‑art realistic attacks on open‑source LLMs and, crucially, succeeds in attacking large reasoning models under free‑form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun‑Liang/REALISTA.

Authors:Alejandro Murillo-Gonzalez, Mahmoud Ali, Lantao Liu
Title: Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization
Abstract:
Multi‑objective reinforcement learning in robotic domains requires balancing complex, non‑convex trade‑offs between conflicting objectives. While linear scalarization methods provide stability, they are theoretically incapable of recovering solutions within non‑convex regions of the Pareto front. Conversely, static non‑linear scalarizations (e.g., Tchebycheff) can theoretically access these regions but often suffer from severe gradient variance and optimization instability in deep RL. In this work, we propose an Adaptive Smooth Tchebycheff framework that resolves this tension by dynamically modulating the curvature of the optimization landscape. We introduce a novel conflict‑driven controller that regulates the optimization smoothness based on real‑time gradient interference. This allows the agent to anneal toward precise, non‑convex scalarization when objectives align, while elastically reverting to stable, smooth approximations when destructive gradient conflicts emerge. We validate our approach on a challenging robotic stealth visual search task ‑‑ a proxy for monitoring of protected/fragile ecosystems ‑‑ where an agent must balance search, exposure/interference minimization and exploration speed. Extensive ablations confirm that our conflict‑aware adaptation enables the robust discovery of Pareto‑optimal policies in non‑convex regions inaccessible to linear baselines and unstable for static non‑linear methods. Website: https://alejandromllo.github.io/research/pasta/

Authors:Jack Young
Title: WriteSAE: Sparse Autoencoders for Recurrent State
Abstract:
We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state‑space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba‑2, and RWKV‑7 write to a d_k × d_v cache through rank‑1 updates k_t v_t^\top that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per‑token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched‑norm ablation on 92.4% of n=4,851 firings at Qwen3.5‑0.8B L9 H4, the 87‑atom population test holds at 89.8%, the closed form predicts measured effects at R^2=0.98, and Mamba‑2‑370M substitutes at 88.1% over 2,500 firings. Sustained three‑position installs at 3× lift midrank target‑in‑continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix‑recurrent write site.

Authors:Zhizhen Zhang, Hyemin Gu, Benjamin J. Zhang, Daniel Elenius, Michael Tyrrell, Theo J. Bourdais, Houman Owhadi, Markos A. Katsoulakis, Tuhin Sahai
Title: ISOMORPH: A Supply Chain Digital Twin for Simulation, Dataset Generation, and Forecasting Benchmarks
Abstract:
Open time‑series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply‑chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi‑echelon logistics network with fully interpretable, user‑configurable parameters and modular topology, demand process, and control rules. The simulator advances a directed routing graph in discrete time: demand arrives at the destination, is served from stock or recorded as backlog, and triggers replenishment through the network. The state vector tracks per‑node on‑hand inventory with outstanding orders, in‑transit shipments, and a smoothed demand estimate, so the dynamics close as a Markov chain on a tractable state space whose transition kernel acts linearly on the empirical distribution of the state. The released data reproduces the bullwhip effect at empirically consistent magnitudes, and three conservation laws encoded in the Markov chain serve as verification tools when users extend the simulator. We release datasets at two catalogue scales (C=50 and C=200) with six scenario sweeps producing 30 additional rollouts and 20 Latin‑hypercube perturbations, exhibiting dynamics absent from fixed TSF benchmarks: variance amplification, cascading bottlenecks, regime shifts, and cross‑channel coupling through shared macro shocks. Zero‑shot evaluation of four foundation models (Chronos, Moirai, TimesFM, Lag‑Llama) shows MASE values exceeding public GIFT‑Eval references at low‑to‑moderate horizons, supporting incorporation into existing benchmarks. The same pairing produces forecast confidence bands via Latin‑hypercube perturbation of demand‑side knobs, forward UQ from parameter uncertainty unavailable on standard TSF datasets, demonstrating that foundation models can serve as fast surrogates for the digital twin's forward UQ. Code (MIT): https://github.com/tuhinsahai/ISOMORPH.

Authors:Simone Antonelli, Vincent Davis, Harrison Rush, Anthony Potdevin, Jesse Shrader, Vikash Singh, Emanuele Rossi
Title: Predicting Channel Closures in the Lightning Network with Machine Learning
Abstract:
The Lightning Network (LN) is a second‑layer protocol for Bitcoin designed to enable fast and cost‑efficient off‑chain transactions. Channels in the LN can be closed either by mutual agreement or unilaterally through a forced closure, which locks the involved capital for an extended period and degrades network reliability. In this paper, we study the problem of predicting channel closure types from publicly available gossip data, framing it as a temporal link classification task over the evolving channel graph. We construct a dataset spanning over two years of LN activity and benchmark a range of machine learning approaches, from MLPs to temporal graph neural networks and spectral encodings. Our experiments reveal that the dominant predictive signals are temporal and behavioural, namely how recently each endpoint was active and the per‑node history of past closures, while the surrounding network topology provides no additional benefit. We find that a simple MLP operating on edge‑level features, node‑level event counts, and temporal patterns outperforms all graph‑based approaches, and discuss how the inherent privacy of the LN, where critical information such as channel balances and payment flows remains hidden, fundamentally limits the predictability of closures from gossip data alone. We publicly release the dataset and code at https://github.com/AmbossTech/ln‑channel‑closure‑prediction to encourage further research on this practically relevant task.

Authors:Paul Hoareau, Kuan Yi Wang, Brandon Bujak, Roy Sun, Govind Nair, Irene Cortese, Charidimos Tsagkas, Daniel Reich, Julien Cohen-Adad
Title: Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data
Abstract:
INTRODUCTION | Fully supervised 3D segmentation of high‑resolution ex vivo MRI is limited by the prohibitive cost of volumetric annotation, forcing reliance on sparse 2D slices. Weakly supervised Sparse‑to‑Dense frameworks bridge this gap, but guidelines remain ambiguous regarding human‑centric visual enhancements and transferring optimization strategies across dimensions. We analyze divergent regularization needs for multi‑class segmentation of high‑resolution ex vivo spinal cord MRI. METHODS | We used 9.4T MRI of multiple sclerosis spinal cords (>104,000 slices) with sparse annotations (428 slices). A 2D Teacher trained on sparse slices generated dense pseudo‑labels to train a 3D Student. We systematically evaluated the impact of human‑centric preprocessing, spatial augmentation, and soft‑label regularization on both architectures. RESULTS | We identified a critical divergence in training dynamics. The 2D Teacher required strong spatial augmentation and soft‑labeling to overcome data scarcity, improving White Matter Lesion Dice scores by >11 points. However, propagating these techniques to the 3D Student degraded its performance. Furthermore, human‑centric preprocessing (e.g., CLAHE) disrupted global statistical cues, dropping Gray Matter Lesion Dice scores by ~25 points. DISCUSSION | Our study highlights a perception divergence (human‑centric contrast enhancement harms machine models) and a regularization conflict across dimensions. 3D architectures trained on dense pseudo‑labels exhibit fundamentally different optimization landscapes than 2D counterparts and require distinct, conservative regularization. Code and models: https://github.com/ivadomed/model_seg_sc‑gm‑lesion_human_ms_exvivo_t2star.

Authors:Mohammad Jahid Ibna Basher, Ali Khodabandeh Yalabadi, Ivan Garibay, Ozlem Ozmen Garibay
Title: ConRetroBert: EMA Stabilized Dual Encoders for Template-Based Single-Step Retrosynthesis
Abstract:
Template based single step retrosynthesis predicts reactants by selecting and applying an explicit reaction template, making each prediction traceable to a chemical transformation rule. This is useful for synthesis planning, but template based methods are often viewed as less competitive than template free models because template prediction is commonly formulated as global classification over a long tailed rule library. We argue that this weakness is not inherent to templates, but to the learning formulation. We present ConRetroBert, a dual encoder framework that reframes template based retrosynthesis as dense product template retrieval followed by candidate set listwise ranking. Stage 1 uses contrastive pretraining to learn a shared embedding space between products and reaction templates. Stage 2 refines template ranking over mined hard negative candidate sets with a multi positive listwise objective. To enable template side adaptation without destabilizing hard negative mining, ConRetroBert uses a slow moving exponential moving average template encoder for retrieval bank construction while updating the live template encoder through the ranking loss. On the local USPTO‑50k benchmark, Stage 2 candidate set ranking improves top‑1 reaction accuracy from 50.5% to 61.3%, while EMA stabilized template adaptation further improves it to 62.4%. Fine tuning from a leakage controlled USPTO‑Full checkpoint reaches 75.4% top‑1 accuracy on USPTO‑50k. We also show that retrieval based template prediction is strong in the long tail of rare templates, and that many correct reactant predictions arise from alternative explicit templates rather than only the recorded positive label. Code and data are available at https://github.com/JahidBasher/ConRetroBert.

Authors:Zhiming Yu, Wangtao Lu, Xin Lai
Title: FePySR: A Neural Feature Extraction Framework for Efficient and Scalable Symbolic Regression
Abstract:
A fundamental challenge in symbolic regression (SR) is efficiently recovering complex mathematical expressions from observational data. Although this problem is NP‑hard, many expressions of practical interest decompose naturally into combinations of nonlinear feature modules, concentrating structural complexity into a small number of reusable components. Here, we introduce FePySR, a two‑stage framework that reduces the SR search space by extracting valid features prior to equation search. FePySR first employs a heterogeneous neural network to constrain observational data to a set of candidate expressions, then performs structural optimization within this refined expression space using PySR. Across five standard benchmarks, FePySR outperforms state‑of‑the‑art methods by achieving higher equation recovery rates. On a set of 75 highly complex synthesized equations, FePySR recovers 36 equations, while producing substantially smaller mean squared errors on the remaining unrecovered cases, with reduced computation time compared to PySR. FePySR's first stage also maintains consistent performance under varying numbers of selected top features and increasing levels of noise in the observational data. Applied to ordinary differential equations governing biological systems, FePySR successfully identifies governing equations in 24 out of 100 tests where PySR recovers none. Taken together, FePySR is a generalizable framework that can enhance the SR solvers, enabling the efficient and reliable recovery of symbolic expressions across scientific domains.

Authors:Mushir Akhtar, M. Tanveer, Mohd. Arshad
Title: CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks
Abstract:
Randomized neural networks (RdNNs) enable efficient, backpropagation‑free training by freezing randomly initialized input‑to‑hidden weights, which permits a closed‑form solution for the output layer. However, conventional random initialization is blind to inter‑feature dependence, ignoring correlations, asymmetries, and tail dependence in the data, which degrades conditioning and predictive performance. To the best of our knowledge, this limitation remains unaddressed in the RdNN literature. To close this gap, we propose CAWI (Copula‑Aligned Weight Initialization), a framework that draws input‑to‑hidden weights from a data‑fitted copula that matches empirical dependence, ensuring the frozen projections respect inter‑feature dependence without sacrificing the closed‑form solution. CAWI (i) maps each feature to the unit interval using empirical CDFs, (ii) fits a multivariate copula that captures rank‑based dependence among features, and (iii) samples each weight column w_j from the fitted copula and applies a fixed inverse marginal transform to set scale. The objective, solver, and "freeze‑once" paradigm remain unchanged; only the sampling law for W becomes dependence‑aware. For dependence modeling, we consider two copula families: elliptical (Gaussian, t) and Archimedean (Clayton, Frank, Gumbel). This enables CAWI to handle diverse dependence, including tail dependence. We evaluate CAWI across 83 diverse classification benchmarks (binary and multiclass) and two biomedical datasets, BreaKHis and the Schizophrenia dataset, using standard shallow and deep RdNN architectures. CAWI consistently delivers significant improvements in predictive performance over conventional random initialization. Code is available at: https://github.com/mtanveer1/CAWI

Authors:Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, Kurt Keutzer, Inderjit S Dhillon, Rishabh Agarwal, Devvrit Khatri
Title: Learning, Fast and Slow: Towards LLMs That Adapt Continually
Abstract:
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task‑specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in‑context learning with fixed LLM parameters can cheaply and rapidly adapt to task‑specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in‑context or in‑weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast‑slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task‑specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast‑Slow Training (FST) is up to 3x more sample‑efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST‑trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL‑training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter‑only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter‑only RL stalls.

Authors:Vage Egiazarian, Erik Schultheis, Andrei Panferov, Earl Killian, Torsten Hoefler, Dan Alistarh
Title: Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
Abstract:
A major recent advance in quantization is given by microscaled 4‑bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating‑point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the "better" among two or more 4‑bit grids marked by one or more bits in the scale value. We formalize the power‑of‑two‑grids (PO2) problem, and provide theoretical results showing that practical small‑group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit‑zero asymmetric grid and 4) SFP4, a TensorCore‑implementable triple which pairs NVFP4 with two shifted variants. Results for post‑training quantization of standard open models and pre‑training of Llama‑like models show that adaptive grids consistently improve accuracy vs single‑grid FP4 under both weight‑only and weight+activation. Source code is available at https://github.com/IST‑DASLab/GridGames.

Authors:Moussa Kassem Sbeyti, Joshua Holstein, Philipp Spitzer, Nadja Klein, Gerhard Satzger
Title: From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review
Abstract:
High‑quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI‑assisted annotation has therefore become standard in large‑scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI‑assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose‑built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box‑level analysis confirms that the cues redirect annotator effort toward high‑uncertainty predictions and away from well‑localized boxes. These findings establish localization uncertainty as a lever to improve human‑in‑the‑loop annotation. Code is available at https://mos‑ks.github.io/MUHA/.

Authors:Junyu Xiong, Yuan Pu, Jia Tang, Yazhe Niu
Title: PriorZero: Bridging Language Priors and World Models for Decision Making
Abstract:
Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior‑dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long‑horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment‑specific dynamics; while end‑to‑end fine‑tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM‑derived conceptual priors into world‑model‑based planning through a decoupled rollout‑training design. During rollout, a novel root‑prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world‑model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine‑grained credit assignment signals for stable LLM fine‑tuning via alternating optimization. Experiments across diverse benchmarks, including text‑based adventure games in Jericho and instruction‑following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM‑empowered decision‑making. Our code is available at https://github.com/opendilab/LightZero.

Authors:Runhe Lai, Xinhua Lu, Yanqi Wu, Jinlun Ye, Weijiang Yu, Ruixuan Wang
Title: Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we present an in‑depth analysis of instruction token embeddings and reveal that they implicitly encode visual information while effectively filtering erroneous information introduced by misleading visual embeddings. Building on this insight, we propose the Instruction Lens Score (InsLen), which combines a Calibrated Local Score with a Context Consistency Score that measures context consistency of the object tokens. The proposed approach serves as a plug‑and‑play object hallucination detector without relying on auxiliary models or additional training. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate that InsLen consistently outperforms existing hallucination detection methods, highlighting its effectiveness and robustness. The code is available at https://github.com/Fraserlairh/Instruction‑Lens‑Score.

Authors:Chengzhu Bao, Xianglong Yan, Zhiteng Li, Guangshuo Qin, Guanghua Yu, Yulun Zhang
Title: SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization
Abstract:
NVFP4 has recently emerged as an efficient 4‑bit microscaling format for large language models (LLMs), offering superior numerical fidelity with native hardware support. However, existing methods often yield suboptimal performance due to inflexible scale selection and the coupled treatment of quantization and dequantization scales. To address these issues, we propose Scale Optimization for Accurate Reconstruction (SOAR), a novel post‑training quantization framework that improves the accuracy of NVFP4 quantization. At its core, SOAR features Closed‑form Joint Scale Optimization (CJSO), which jointly optimizes global and block‑wise scales via analytical solutions derived from reconstruction error minimization. Furthermore, it incorporates Decoupled Scale Search (DSS). DSS decouples the high‑precision quantization scale from its constrained dequantization counterpart, and performs discrete search to mitigate precision loss from scale quantization. Extensive experiments across multiple LLMs show that our method consistently outperforms existing NVFP4 quantization baselines, achieving superior accuracy under the same memory footprint with no additional hardware overhead. The code and models will be available at https://github.com/steven‑bao1/SOAR.

Authors:Matthew M. Hong, Jesse Zhang, Anusha Nagabandi, Abhishek Gupta
Title: TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
Abstract:
Fine‑tuning pre‑trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre‑training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre‑training and RL fine‑tuning. Our pre‑training method, Context‑Smoothed Pre‑training (CSP), injects forward‑diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine‑tune pre‑trained policies via Timestep‑Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine‑tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image‑based VLA policies, we show that TMRL improves RL fine‑tuning sample efficiency. Notably, TMRL enables successful real‑world fine‑tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.

Authors:Joshua Opria
Title: STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
Abstract:
We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio‑to‑chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi‑stage hybrid: a two‑stage CRNN onset detector and a six‑model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word‑aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30‑song in‑envelope benchmark constructed by screening candidate songs on a single audio‑quality criterion ‑‑ the median 1‑second drum‑stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/‑ 100 ms tolerance with per‑song global offset search. We report a complete ablation of seven drum‑pipeline components with paired per‑song Wilcoxon tests, an analysis of ground‑truth‑to‑audio timing distributions in community Clone Hero charts, and a per‑class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.

Authors:Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Xiong Jun Wu, Likang Wu, Hongke Zhao
Title: Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Abstract:
Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO‑style off‑policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emphtraining‑‑inference discrepancy term that aligns inference‑side and training‑side distributions at the same behavior‑policy version, and a \emphpolicy‑staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training‑side logits, or old logits. This missing‑old‑logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old‑logit acquisition strategies: snapshot‑based version tracking, a dedicated old‑logit model, and synchronization via partial rollout interruption, and compare their system trade‑offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO‑EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.

Authors:Muhammad Aqeel, Maham Nazir, Uzair Khan, Marco Cristani, Francesco Setti
Title: Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection
Abstract:
Zero‑shot anomaly detection aims to identify defects in unseen categories without target‑specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA‑DINO, an anomaly‑aware vision‑language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text‑guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context‑specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state‑of‑the‑art performance, achieving 93.5% image‑AUROC on MVTec‑AD and strong cross‑domain generalization to medical imaging without domain‑specific fine‑tuning. https://github.com/aqeeelmirza/AVA‑DINO

Authors:Xu Chu, Guanyu Wang, Zhijie Tan, Xinrong Chen, Ziyu Li, Tong Mo, Weiping Li
Title: Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
Abstract:
Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model's applications in scenarios such as in‑context learning and Retrieval‑Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset‑based search, but these methods increase inference overhead while leaving the model's inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine‑tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose Dual Group Advantage Optimization (DGAO), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra‑group relative accuracy advantage and inter‑group relative stability advantage, rewarding the policy model for generating order‑stable and correct outputs while penalizing order‑sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs' order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo‑stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: https://github.com/Hyalinesky/DGAO.

Authors:Xin Ma, Wei Chen, Qi Liu, Derong Xu, Zhi Zheng, Tong Xu, Enhong Chen
Title: More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing
Abstract:
Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter‑intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self‑reinforcing stability loop and proves that, when combined with ridge‑regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm‑up stage and full whitening, improving long‑horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at https://github.com/MINE‑USTC/StableEdit.

Authors:Thor Klamt, Wolfgang Nejdl, Ming Tang
Title: Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling
Abstract:
Machine‑learning predictors of biochemical activity often exhibit large random‑split‑to‑leave‑one‑target‑out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation‑science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis‑targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random‑split cross‑validation, while the leave‑one‑target‑out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within‑target interpolation, whereas LOTO measures the novel‑target prediction that de‑novo design depends on. We decompose this gap and identify inter‑laboratory measurement variance as the dominant component, anchored by a within‑target cross‑laboratory cascade bounding the inter‑laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation‑threshold choice. Across eight published architectures and ESM‑2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES‑level deduplication; a 21‑dimensional 2000‑trial hyperparameter optimisation cannot break this ceiling, and the rank‑1 single‑seed configuration regresses by 0.161 AUROC under multi‑seed validation, matching a closed‑form selection‑bias prediction (Bailey and Lopez de Prado, 2014). Few‑shot k=5 stratified per‑target retraining combined with ADMET features lifts 65‑target LOTO AUROC from 0.668 to 0.7050, and post‑hoc Platt scaling recovers raw output to within the 0.05 well‑calibrated threshold. We release PROTAC‑Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance‑decomposition framework, the per‑target calibration protocol, and the evaluation code.

Authors:Yingjie Zhou, Yuqin Xie, Fanxing Liu, Dongjin Song, Ce Zhu, Lingqiao Liu
Title: Learning Feature Encoder with Synthetic Anomalies for Weakly Supervised Graph Anomaly Detection
Abstract:
Weakly supervised graph anomaly detection aims to unveil unusual graph instances, e.g., nodes, whose behaviors significantly differ from normal ones, given only a limited number of annotated anomalies and abundant unlabeled samples. A major challenge is to learn a meaningful latent feature representation that reduces intra‑class variance among normal data while remaining highly sensitive to anomalies. Although recent works have applied self‑supervised feature learning for graph anomaly detection, their strategies are not specifically tailored to its unique requirements, motivating our exploration of a more domain‑specific approach. In this paper, we introduce a weakly supervised graph anomaly detection method that leverages a feature learning strategy tailored for graph anomalies. Our approach is built upon a multi‑task learning scheme that extracts robust feature representations through synthesized anomalies. We generate synthetic anomalies by perturbing the normal graph in various ways and assign a dedicated detection head to each anomaly type, ensuring that learned features are sensitive to potential deviations from normal patterns. Although synthetic anomalies may not perfectly replicate real‑world patterns, they provide valuable auxiliary data for effective feature learnin, much like features learned from ImageNet classification transfer to downstream vision tasks. Additionally, we adopt a two‑phase learning strategy: an initial warm‑up phase using only synthetic samples, followed by a full‑training phase integrating both tasks, to balance the influence of synthetic and real data. Extensive experiments on public datasets demonstrate the superior performance of our method over its competitors. Code is available at https://github.com/yj‑zhou/SAWGAD.

Authors:SeongMin Jin, Doo Seok Jeong
Title: WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views
Abstract:
Learning latent representations that capture both semantic and spatial information is central to efficient spatio‑semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task‑specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity‑dependent encoder that maps a given observation into a spatio‑semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio‑semantic representation. Using facial landmark localization as a proof‑of‑concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real‑time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio‑semantic reasoning. This framework is open‑sourced at https://github.com/JinSeongmin/WorldComp2D.

Authors:Yan Jiang, Ruihong Qiu, Zi Huang
Title: Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Abstract:
Recently, reinforcement learning (RL) has been widely applied during post‑training for diffusion large language models (dLLMs) to enhance reasoning with block‑wise semi‑autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post‑training in multi‑domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi‑domain RL for dLLMs, which will largely affect the post‑training effectiveness for rollout‑based RL methods; (2) a novel dataset, Block‑R1‑41K is constructed with a best‑improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block‑R1, for flexible RL post‑training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross‑domain post‑training method with sample‑level best‑improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms and diverse dLLM backbones are comprehensively covered in Block‑R1. The benchmark is open‑sourced at https://github.com/YanJiangJerry/Block‑R1 with the dataset released at https://huggingface.co/datasets/YanJiangJerry/Block‑R1‑41K.

Authors:Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
Title: Debiased Model-based Representations for Sample-efficient Continuous Control
Abstract:
Model‑based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off‑policy actor‑critic learning. It implicitly combines the advantages of both model‑free and model‑based approaches while avoiding the training costs associated with model‑based methods. Nevertheless, existing model‑based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor‑critic learning, leading to inferior performance. To address this, we propose Debiased model‑based Representations for Q‑learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state‑action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.

Authors:Liqin Ye, Yanbin Yin, Michael Galarnyk, Yuzhao Heng, Sudheer Chava, Chao Zhang
Title: Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling
Abstract:
The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post‑training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual‑axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on https://github.com/liqinye/EvoTD.

Authors:Madhurima Panja, Danny D'Agostino, Huitao Li, Tanujit Chakraborty, Nan Liu
Title: EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting
Abstract:
The increasing adoption of data‑driven decision‑making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high‑quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large‑scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state‑of‑the‑art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).

Authors:Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu
Title: Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
Abstract:
Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group‑based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high‑scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the winner‑takes‑all nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team‑level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward‑weighted semantic embeddings, where only correct and non‑redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non‑redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at \hrefhttps://github.com/bradybuddiemarch/gcpothis.

Authors:Yupeng Su, Ruijie Zhang, Ziyue Liu, Yequan Zhao, Zheng Zhang
Title: MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
Abstract:
The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon's optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low‑bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre‑quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt μ‑law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense‑region distinguishability. Together, these techniques enable stable 4‑bit quantization of Muon's optimizer states. Pre‑training experiments on GPT‑style and LLaMA‑style models demonstrate that MuonQ at 4‑bit precision closely matches full‑precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 ×. Our code is available at https://github.com/YupengSu/MuonQ.

Authors:Ajay Vikram Periasami, Junlin Wang, Bhuwan Dhingra
Title: Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
Abstract:
Image‑to‑code generation tests whether a vision‑language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain‑specific reconstruction errors. We introduce Vision2Code, a reference‑code‑free benchmark and evaluation framework for multi‑domain image‑to‑code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset‑specific rubrics and deterministic guardrails for severe semantic failures. We report render‑success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding‑similarity baselines. Across nine open‑weight and proprietary models, we find that image‑to‑code performance is domain‑dependent: leading models perform well on regular chart‑ and graph‑like visuals but remain weak on spatial scenes, chemistry, documents, and circuit‑style diagrams. Finally, we show that evaluator‑filtered model outputs can serve as training data to improve image‑to‑code capability, with Qwen3.5‑9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image‑to‑code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.

Authors:Jung Min Kang
Title: The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Abstract:
Benchmark evaluation across AI and safety‑critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co‑occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains ‑‑ NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity ‑‑ we show that Spearman rank correlation ρ between simple‑average rankings and ground‑truth rankings degrades from ρ= 1.000 at 100% coverage to ρ= 0.809 at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two‑parameter logistic (2PL) Item Response Theory (IRT) model maintains ρ\geq 0.996 across all conditions. A 150‑condition grid sweep over sparsity S \in [0, 0.70] and difficulty gap D \in [0.5, 5.0] confirms that ranking error forms a failure surface with a strong S × D interaction (γ_3 = +0.20, t = 13.05), while IRT maintains ρ\geq 0.993 throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.

Authors:Elias B. Krey, Nils Neukirch, Nils Strodthoff
Title: FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry
Abstract:
Intermediate feature representations represent the backbone for the expressivity and adaptability of deep neural networks. However, their geometric structure remains poorly understood. In this submission, we provide indirect insights into this matter by applying a broad selection of manipulations in input space, ranging from geometric and photometric transformations to local masking and semantic manipulations using generative image editing models, and assess the feasibility of learning a mapping in the feature space, mapping from the original to the manipulated feature map. To this end, we devise different types of mappings, from linear to non‑linear and local to global mappings and assess both the reconstruction quality of the mapping as well as the semantic content of the mapped representations. We demonstrate the feasibility of learning such mappings for all considered transformations. While global (transformer) models that operate on the full feature map often achieve best results, we show that the same can be achieved with a shared linear model operating on a single feature vector typically with very little degradation in reconstruction quality, even for highly non‑trivial semantic manipulations. We analyze the corresponding mappings across different feature layers and characterize them according to dominance of weight vs. bias and the effective rank of the linear transformations. These results provide hints for the hypothesis that the feature space is to a first degree of approximation organized in linear structures. From a broader perspective, the study demonstrates that generative image editing models might open the door to a deeper understanding of the feature space through input manipulation.

Authors:Jonas Petersen, Gian-Alessandro Lombardi, Riccardo Maggioni, Camilla Mazzoleni, Federico Martelli, Philipp Petersen
Title: HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
Abstract:
Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is scarce because such events are rare and costly to annotate. We introduce HEPA (Horizon‑conditioned Event Predictive Architecture), built on two key principles. First, a causal Transformer encoder is pretrained via a Joint‑Embedding Predictive Architecture (JEPA): a horizon‑conditioned predictor learns to forecast future representations rather than future values, forcing the encoder to capture predictable temporal dynamics from unlabeled data alone. Second, we freeze the encoder and finetune only the predictor toward the target event, producing a monotonic survival cumulative distribution function (CDF) over horizons. With fixed architecture and optimiser hyperparameters across all benchmarks, HEPA handles water contamination, cyberattack detection, volatility regimes, and eight further event types across 11 domains, exceeding leading time‑series architectures including PatchTST, iTransformer, MAE, and Chronos‑2 on at least 10 of 14 benchmarks, with an order of magnitude fewer tuned parameters and, on lifecycle datasets, an order of magnitude less labeled data.

Authors:Nengneng Yu, Sixian Xiong, Yibo Zhao, Wei Wang, Zaoxing Liu
Title: Enabling Performant and Flexible Model-Internal Observability for LLM Inference
Abstract:
Today's inference‑time workloads increasingly depend on timely access to a model's internal states. We present DMI‑Lib, a high‑speed deep model inspector that treats internal observability as a first‑class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU‑CPU memory abstraction for capturing and staging tensors, and a policy‑controlled host backend that exports them. DMI‑Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI‑Lib incurs only 0.4%‑‑6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x‑15x compared to existing baselines with similar observability features. DMI‑Lib is open‑sourced at https://github.com/ProjectDMX/DMI.

Authors:Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren
Title: Backbone-Equated Diffusion OOD via Sparse Internal Snapshots
Abstract:
Fair comparison between diffusion‑based OOD detectors is challenging, as conclusions can vary with backbone choice, corruption parameterization, and test‑time budget. We address this issue through a Mutualized Backbone‑Equated (MBE) protocol that aligns canonical corruption levels and logical test‑time cost across diffusion backbones. Within this setting, we introduce Canonical Feature Snapshots (CFS), a family of detectors that probes a frozen diffusion backbone using only a tiny number of native internal activations at canonical low‑noise levels. On a controlled CIFAR‑scale benchmark, the strongest one‑forward CFS variant is CFS(1x2), while an even smaller decoder‑only variant remains highly competitive. This shows that much of the relative‑OOD signal exposed by frozen diffusion backbones is concentrated in a small number of sparse internal states, rather than requiring full denoising trajectories or high‑capacity downstream heads. We further provide a local diagnostic theory explaining these observations through conditional encoder‑decoder complementarity, diagonal‑score separation, and low‑noise corruption stability. The official implementation is available at https://github.com/RouzAY/cfs‑diffusion‑ood/.

Authors:Taekhyun Park, Yongjae Lee, Dohee Kim, Hyerim Bae
Title: LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
Abstract:
Looped computation shows promise in improving the reasoning‑oriented performance of LLMs by scaling test‑time compute. However, existing approaches typically require either training recurrent models from scratch or applying disruptive retrofits, which involve substantial computational costs and may compromise pretrained capabilities. To address these limitations, we introduce Looped Depth Up‑Scaling (LoopUS), a post‑training framework that converts a standard pretrained LLM into a looped architecture. As a key technical contribution, LoopUS recasts the pretrained LLM into an encoder, a looped reasoning block, and a decoder. It operationalizes this latent‑refinement architecture through four core components: (1) block decomposition, guided by staged representation dynamics; (2) an input‑dependent selective gate to mitigate hidden‑state drift; (3) random deep supervision for memory‑efficient learning over long recursive horizons; and (4) a confidence head for adaptive early exiting. Collectively, these mechanisms transform a standard non‑looped model into a looped form while stabilizing it against both computational bottlenecks and representation collapse. Through stable latent looping, LoopUS improves reasoning‑oriented performance without extending the generated traces or requiring recurrent training from scratch. For more details, see https://thrillcrazyer.github.io/LoopUS

Authors:Yonatan Sverdlov, Benjamin Friedman, Snir Hordan, Nadav Dym
Title: When and How to Canonize: A Generalization Perspective
Abstract:
While invariant architectures are standard for processing symmetric data, there is growing interest in achieving invariance by applying group averaging or canonization to non‑invariant backbones. However, the theoretical generalization properties of these alternative strategies remain poorly understood. We introduce a theoretical framework to analyze the generalization error of these methods by bounding their covering numbers. We establish a rigorous generalization hierarchy: the error bounds of canonized models are at best equal to the error bounds of structurally invariant and group‑averaged models, and at worst equal to the bounds of non‑invariant baselines. Furthermore, we show that there exist optimal canonizations which attain the optimal error bounds, and poor canonizations which attain the non‑invariant error bounds, and that this depends on the regularity of the canonization. Finally, applying this framework to permutation groups in point cloud processing, we rigorously prove that the covering number of lexicographical sorting grows exponentially with point cloud dimension, whereas Hilbert curve canonization guarantees polynomial growth. This provides the first formal theoretical justification for the empirical success of Hilbert curve serialization in state‑of‑the‑art point cloud architectures. We conclude with experiments that support our theoretical claims. Code is available at https://github.com/yonatansverdlov/Canonization

Authors:Yutszyuk Wong, Wentai Wu, Yuen-Ying Yeung, Weiwei Lin
Title: Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation
Abstract:
Log anomaly detection is a critical task for system operations and security assurance. However, in networked systems at scale, log data are generated at massive scale while instance‑level annotations are prohibitively expensive, posing great difficulties to fine‑grained anomaly localization. To address this challenge, we propose LogMILP (Log anomaly localization based on Multi‑Instance Learning enhanced by prototypes and Perturbation), a weakly supervised framework that enables both bag‑level anomaly detection and instance‑level anomaly localization using only bag‑level labels. Our method guides the model to pinpoint the critical log entries using prototype‑guided structural modeling with counterfactual perturbation consistency regularization, thereby improving localization reliability and interpretability under coarse‑grained supervision. Experimental results on three public datasets demonstrate that LogMILP achieves competitive detection performance while yielding significantly more reliable instance‑level localization. Our code is open‑sourced at https://github.com/YUK1207/LogMILP.

Authors:Hangzhan Jin, Tianwei Ni, Lu Li, Pierre-Luc Bacon, Mohammad Hamdaqa, Doina Precup
Title: Rotation-Preserving Supervised Fine-Tuning
Abstract:
Supervised fine‑tuning (SFT) improves in‑domain performance but can degrade out‑of‑domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss‑sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher‑sensitive directions, which we call Rotation‑Preserving Supervised Fine‑Tuning (RPSFT). RPSFT penalizes changes in the projected top‑k singular‑vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in‑domain/OOD trade‑off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine‑tuning. Code is available at \hrefhttps://github.com/jinhangzhan/RPSFT.githttps://github.com/jinhangzhan/RPSFT.

Authors:Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu
Title: DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
Abstract:
While Mixture‑of‑Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory‑access bottlenecks, which hinder efficient end‑side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU‑based routing enhanced by learnable expert‑wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed‑expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non‑gated MLP experts with ReLU‑based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00× speedup on real hardware compared with dense inference. Codes and checkpoints are all available at https://github.com/thunlp/DECO.

Authors:Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng
Title: Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
Abstract:
Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero‑skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non‑monotonic, task‑ and stage‑dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave‑one‑skill‑out validation, then applies three lifecycle operations: retaining high‑value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill‑based agentic RL.

Authors:Yixuan Yang, Mehak Arora, Ryan Zhang, Baraa Abed, Junseob Kim, Tilendra Choudhary, Md Hassanuzzaman, Kevin Zhu, Ayman Ali, Chengkun Yang, Alasdair Edward Gent, Victor Moas, Rishikesan Kamaleswaran
Title: Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
Abstract:
We present Clin‑JEPA, a multi‑phase co‑training framework for joint‑embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent‑space planning in robotics and high‑quality representation learning in vision, but extending the paradigm to EHR data ‑‑ to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk‑prediction tasks without per‑task fine‑tuning ‑‑ remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I‑JEPA, V‑JEPA) or train it on a frozen pretrained encoder (V‑JEPA 2‑AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co‑training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co‑training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin‑JEPA's five‑phase pretraining curriculum ‑‑ predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization ‑‑ addresses each failure mode by phase, stably co‑training a Qwen3‑8B‑based encoder and a 92M‑parameter latent trajectory predictor. On MIMIC‑IV ICU data, three independent evaluations support the framework: (1) latent \ell_1 rollout drift uniquely converges (‑15.7%) over 48‑hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating‑patient cohorts displace 4.83× further than stable patients in latent space, vs \leq2.62× for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi‑task downstream evaluation. Clin‑JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).

Authors:Gabriel Garcia
Title: The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
Abstract:
Corruption studies, the standard tool for evaluating chain‑of‑thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emphanswer placement rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about 19× for Qwen~2.5‑3B (N=300, p=0.022). Conflicting‑answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near‑zero at 7B across five open‑weight model families; wrong‑answer following is strong at 3B‑‑7B and attenuates sharply at larger scales. Replications on MATH, within‑stable comparisons at 7B, and suffix‑free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation‑time probes indicate that final answers are rarely early‑determined during generation (<5% early commitment), yet consumption‑time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three‑prerequisite protocol (question‑only control, format characterization, and an all‑position sweep) as a practical minimum for future corruption‑based faithfulness studies.

Authors:Daniel Goldstein, Eugene Cheah
Title: Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory
Abstract:
We present Key‑Value Means ("KVM"), a novel block‑recurrence for attention that can accommodate either fixed‑size or growing state. Equipping a strong transformer baseline with fixed‑size KVM attention layers yields a strong O(N) chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long‑context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk‑wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk‑wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV‑cache memory, and allowing a continuous range of choices of prefill time complexity between O(N) and O(N^2). It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM‑paper and trained models at https://huggingface.co/collections/recursal/key‑value‑means under the Apache 2.0 license.

Authors:Zahra Asadi, Haeseung Jeon, Sohyun Han, Md Mahmuduzzaman Kamol, Se Eun Oh, Mohammad Saidur Rahman
Title: FreeMOCA: Memory-Free Continual Learning for Malicious Code Analysis
Abstract:
As over 200 million new malware samples are identified each year, antivirus systems must continuously adapt to the evolving threat landscape. However, retraining solely on new samples leads to catastrophic forgetting and exploitable blind spots, while retraining on the entire dataset incurs substantial computational cost. We propose FreeMOCA, a memory‑ and compute‑efficient continual learning framework for malicious code analysis that preserves prior knowledge via adaptive layer‑wise interpolation between consecutive task updates, leveraging the fact that warm‑started task optima are connected by low‑loss paths in parameter space. We evaluate FreeMOCA in both class‑incremental (Class‑IL) and domain‑incremental (Domain‑IL) settings on large‑scale Windows (EMBER) and Android (AZ) malware benchmarks. FreeMOCA achieves substantial gains in Class‑IL, outperforming 11 baselines on both EMBER and AZ benchmarks. It also significantly reduces forgetting, achieving the best retention across baselines, and improving accuracy by up to 42% and 37% on EMBER and AZ, respectively. These results demonstrate that warm‑started interpolation in parameter space provides a scalable and effective alternative to replay for continual malware detection. Code is available at: https://github.com/IQSeC‑Lab/FreeMOCA.

Authors:Lennon J. Shikhman
Title: HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations
Abstract:
Neural operators provide fast surrogate models for time‑dependent partial differential equations, but their standard autoregressive use usually assumes that the instantaneous field u(t,\cdot) is a complete state. This assumption fails for delay equations, distributed‑memory systems, and other non‑Markovian dynamics: two trajectories may agree at time t and nevertheless have different futures because their histories differ. We introduce the History‑Space Fourier Neural Operator (HS‑FNO), a neural operator for delay and memory‑driven PDEs formulated on the lifted state u_t(θ,x)=u(t+θ,x), θ\in[‑τ,0]. The key computational step is to decompose one history‑state update into a learned predictor for the newly exposed future slice and an exact shift‑append transport for the portion of the history window already known from the previous state. This avoids learning deterministic history coordinates, reduces the learned output dimension, and enforces the natural discrete history update. We test HS‑FNO on five benchmark families covering delayed reaction‑‑diffusion, spatial epidemiology, nonlocal neural‑field dynamics, delayed waves, and distributed‑memory closures. Across ten random seeds, HS‑FNO attains the lowest aggregate one‑step, history‑space, and rollout errors among the principal baselines. The largest gain occurs in autoregressive prediction, where aggregate rollout error decreases from 0.241, 0.188, and 0.185 for current‑state, lag‑stack, and unconstrained history‑to‑history operators, respectively, to 0.094. The same model uses fewer parameters than unconstrained history prediction. These results indicate that enforcing the discrete shift structure of history‑state evolution is an effective inductive bias for non‑Markovian PDE surrogate modeling.

Authors:Dong Yang, Yiyi Cai, Haoyu Zhang, Yuki Saito, Hiroshi Saruwatari
Title: Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
Abstract:
Metric‑induced discrete flow matching (MI‑DFM) exploits token‑latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite‑step path‑tracking error from its first‑order continuous‑time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic‑optimal scheduler for prescribed scalar‑parameterized probability paths, and instantiate it for MI‑DFM as a training‑free numerical schedule that traverses the path at constant Fisher‑Rao speed. Second, we introduce a finite‑step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec‑based zero‑shot text‑to‑speech (TTS). Under controlled comparisons with a unified architecture and large‑scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state‑of‑the‑art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject

Authors:Junyu Lu, Shashwath Suresh, Hao Liu, Qi Hong, Qing Wang
Title: Split CNN Inference on Networked Microcontrollers
Abstract:
Running deep neural networks on microcontroller units (MCUs) is severely constrained by limited memory resources. While TinyML techniques reduce model size and computation, they often fail in practice due to excessive peak Random Access Memory (RAM) usage during inference, dominated by intermediate activations. As a result, many models remain infeasible on standalone MCUs. In this work, we present a fine‑grained split inference system for networked MCUs that enables collaborative inference of Convolutional Neural Networks (CNN) models across multiple devices. Our key insight is that breaking the memory bottleneck requires splitting inference at sub‑layer granularity rather than at layer boundaries. We reinterpret pre‑trained models to enable kernel‑wise and neuron‑wise partitioning, and distribute both model parameters and intermediate activations across multiple MCUs. A lightweight, resource‑aware coordinator orchestrates the inference across MCU devices with heterogeneous resources. We implement the proposed system on a real testbed and evaluate it on up to 8 MCUs using MobileNetV2, a representative CNN model. Our experimental results show that CNN models infeasible on a single MCU can be executed across networked MCUs, reducing the per‑MCU peak RAM usage while maintaining the practical end‑to‑end inference latency. All the source code of this work can be found here: https://github.com/shashsuresh/split‑inference‑on‑MCUs.

Authors:Jiyeon Kim, Byungju Lee, Won-Yong Shin
Title: Teaching Molecular Dynamics to a Non-Autoregressive Ionic Transport Predictor
Abstract:
Unlike most static material properties widely studied in the machine learning literature, ionic transport properties are inherently dynamic, making their fast and accurate prediction from static atomic structures challenging. The current standard approach, molecular dynamics (MD) simulations, suffers from prohibitively high computational cost. Recent autoregressive learning‑based MD acceleration methods requiring sequential inference remain slow and prone to error accumulation; in contrast, existing non‑autoregressive material property prediction models are less accurate because they fail to exploit dynamics. Moreover, existing methods typically benefit from datasets either with or without atomic trajectories, but not both. To overcome these limitations, we propose a non‑autoregressive learning framework based on auxiliary modality learning, which treats atomic trajectories as an auxiliary modality during training but does not require them at inference. This enables the predictor to learn dynamics without sequential inference while benefiting from both types of datasets. As a result, our framework achieves over 200 times speedup compared to autoregressive models on the dataset with atomic trajectories while substantially reducing prediction error relative to non‑autoregressive benchmarks across both types of datasets. Our code is available at https://github.com/jykim‑git/MD.

Authors:Boxuan Zhang, Jianing Zhu, Qifan Wang, Jiang Liu, Ruixiang Tang
Title: Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
Abstract:
Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI‑generated images. Yet existing detectors based on pre‑trained feature extractors tend to over‑rely on global semantics, limiting sensitivity to the critical micro‑defects. In this work, we propose Micro‑Defects expose Macro‑Fakes (MDMF), a local distribution‑aware detection framework that amplifies micro‑scale statistical irregularities into macro‑level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory‑grounded analysis shows that patch‑wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: https://zbox1005.github.io/MDMF‑project/

Authors:Muhammed Ustaomeroglu, Guannan Qu
Title: Towards Effective Theory of LLMs: A Representation Learning Approach
Abstract:
We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden‑state trajectories using a BYOL/JEPA‑style self‑supervised objective, coarse‑graining activations into macrovariables that preserve higher‑level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental‑state" trajectories of reasoning, capture high‑level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high‑level, dynamically meaningful variables that support interpretation, prediction, and intervention.

Authors:Jiyeon Kim, Youngjoon Hong, Won-Yong Shin
Title: Semi-Supervised Neural Super-Resolution for Mesh-Based Simulations
Abstract:
Mesh‑based simulations provide high‑fidelity solutions to partial differential equations (PDEs), but achieving such accuracy typically requires fine meshes, leading to substantial computational overhead. Super‑resolution techniques aim to mitigate this cost by reconstructing high‑resolution (HR), high‑fidelity solutions from low‑cost, low‑resolution (LR) counterparts. However, training neural networks for super‑resolution often demands large amounts of expensive HR supervision data. To address this challenge, we propose SuperMeshNet, an HR data‑efficient super‑resolution framework for mesh‑based simulations aided by message passing neural networks (MPNNs). At its core, SuperMeshNet introduces complementary learning, a semi‑supervised approach that effectively leverages both 1) a small amount of paired LR‑HR data and 2) abundant unpaired LR data via two jointly trained, complementary MPNN‑based models. Additionally, our model is enriched by inductive biases, which are empirically shown to further improve super‑resolution performance. Extensive experiments demonstrate that SuperMeshNet requires 90% less HR data to achieve even lower root mean square error (RMSE) than that of the fully supervised benchmark without the inductive biases. The source code and datasets are available at https://github.com/jykim‑git/SuperMeshNet.git.

Authors:Kai Zhao, Dongliang Nie, Yuchen Lin, Zhehan Luo, Yixiao Gu, Deng-Ping Fan, Dan Zeng
Title: Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models
Abstract:
Joint‑Embedding Predictive Architectures (JEPAs) provide a simpleframework for learning world models by predicting future latent representations.However, JEPA training is subject to a bias‑variance tradeoff.Without sufficient structural constraints, excessive representationalvariance causes the model to collapse to trivial solutions.The recent LeWorldModel (LeWM) shows that this issue can be alleviated bysimply constraining latent embeddings with an isotropic Gaussian prior.However, latent representations inherently lie on low‑dimensional manifoldswithin a high‑dimensional ambient space, and enforcing an isotropic Gaussianprior directly in this ambient space introduces an overly strong bias.In this work, we propose ame, which seeks a favorable operatingpoint on the bias‑variance frontier by applying Gaussian constraints inmultiple random subspaces rather than in the originalembedding space.This design relaxes the global constraint while preserving itsanti‑collapse effect, leading to a better balance between trainingstability and representation flexibility.Extensive experiments across fourcontinuous‑control environments demonstrate that consistentlyoutperforms LeWM with very clear margins.Our method is simple yet effective, and serves as a strong baseline for future JEPA‑based world model research.fdefinedeeemodeThe code is available at https://github.com/intcomp/Sub‑JEPA.

Authors:Sohan Venkatesh
Title: Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
Abstract:
Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near‑perfect accuracy at every post‑embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format‑triggered multi‑layer perceptron (MLP) block overwrites the correctly‑encoded count with a fixed wrong answer at roughly 88‑‑93,% network depth. This prior fires for repeated word‑tokens in space‑separated list format and is absent for repeated digit‑tokens. It is suppressed by comma‑separated delimiters in larger models but persists in smaller ones. The finding holds across Llama‑3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.

Authors:Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, Bamdev Mishra
Title: Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
Abstract:
Muon and related norm‑constrained matrix optimizers have become central to large‑scale learning problems. They are formulated as a linear maximization oracle (LMO) over an ambient matrix‑norm ball in unconstrained Euclidean space. However, these do not generalize cleanly to manifold‑valued parameters such as low‑rank factorizations, orthogonality constraints, or symmetric positive definite (SPD) matrices. Naively restricting the Muon LMO to the tangent space (i) breaks quotient symmetries and (ii) couples the tangent‑space constraint with an ambient norm bound, thereby obstructing closed‑form solutions on various manifolds of interest. We resolve both issues with a single observation: every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic norm constrained LMO is symmetry preserving. Building on this, we introduce intrinsic Muon (iMuon), a unified framework that yields closed‑form updates on the fixed‑rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm, including the spectral, Frobenius, and nuclear norms. We establish convergence guarantees for both deterministic and stochastic iMuon with rate constants that depend only on the manifold dimension. Notably, on the fixed‑rank manifold this constant depends only on the rank, making the rate independent of factor conditioning and removing the runtime factor‑rescaling required by prior work. Experiments on LoRA finetuning of LLMs, image classification, and subspace learning illustrate the efficacy of the proposed approach.

Authors:Yang Zhou, Can Jin, Zihan Dong, Zhepeng Wang, Yanting Yang, Shiyu Zhao, Lei Li, Runxue Bao, Yaochen Xie, Dimitris N. Metaxas
Title: DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
Abstract:
Reinforcement learning improves the reasoning ability of large language models but remains costly and sample‑inefficient, as many rollouts provide weak learning signals. Difficulty‑aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final‑performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose Dare, a unified framework that co‑evolves difficulty estimation with the policy via self‑normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments across multiple models and domains demonstrate that Dare consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones. Code is available at https://github.com/EtaYang10th/DARE.

Authors:Ankit Hemant Lade, Sai Krishna Jasti, Indar Kumar, Aman Chadha
Title: Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)
Abstract:
A Mamba state‑space model trained only for next‑step prediction appears to recover Granger‑causal structure through a simple readout S = |W_out W_in|, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at p < 10^‑5. We package the protocol used to test that claim ‑‑ standardized synthetic generators (VAR/Lorenz/CauseMe‑style), three intervention semantics (do(X=c), soft‑noise, random‑forcing), edge‑provenance cards on three real datasets, and size‑matched control arms ‑‑ as a reusable falsification benchmark, and walk the claim through it in five stages. The method‑level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe‑style benchmarks, and on Lorenz‑96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample‑size confound, and the residual disappears under standard do(X=c) interventions, surviving only under a non‑standard random‑forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger ‑‑ the effect is method‑agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.

Authors:Chuning Li, Chris J. Maddison
Title: Predicting Large Model Test Losses with a Noisy Quadratic System
Abstract:
We introduce a predictive model that estimates the pre‑training loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N, B, K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model‑selected configurations are close to ground‑truth optimal. Our work advocates for loss prediction as a better alternative to heuristic‑based laws, which are growing in complexity. The implementation is available on https://github.com/chuningxdy/Noisy‑Quadratic‑System.

Authors:Enrique Hernández Noguera, Md Meftahul Ferdaus, Elias Ioup, Mahdi Abdelguerfi, Julian Simeonov
Title: Bridging Spectral Operator Learning and U-Net Hierarchies: SpectraNet for Stable Autoregressive PDE Surrogates
Abstract:
Neural operators for time‑dependent PDEs face a structural tension: spectral architectures (FNO and descendants) inherit exponential rollout‑error growth from their one‑step Lipschitz constant, while hierarchical U‑Net operators trade resolution invariance for multi‑scale detail. We introduce SpectraNet, an autoregressive neural operator that composes truncated spectral convolutions inside a U‑Net hierarchy with a Residual‑Target Spectral Block trained under a Semigroup‑Consistency Loss. The residual‑target parametrization replaces L^T stability blow‑up with linear Tdelta drift, and the spectral path's parameter count is Theta(L w^2 M^2), independent of grid N. Under a single unified protocol against 16 published neural‑operator baselines on Navier‑Stokes nu=1e‑5 at 64x64, SpectraNet reaches test relative L2 = 0.0822 at 2.04M parameters ‑‑ 2.33x fewer than canonical FNO at ~20% lower error ‑‑ and wins five of six rows in a cross‑PDE comparison against FNO (NS at nu in 1e‑4, 1e‑3, PDEBench Shallow‑Water 2D and Diffusion‑Reaction, with the Active‑Matter row going to FNO inside its seed spread). Trained from scratch at native 128^2 under the same protocol, SpectraNet improves to 0.0724 while FNO regresses to 0.3080. Free rollout stays bounded for T=100 where FNO diverges across all 200 test trajectories. On consumer CPU at B=1, SpectraNet runs sub‑200ms while the full‑attention Transformer that wins raw L2 pays ~60x latency; we do not claim to beat that Transformer on raw L2, only to dominate the lightweight (<=5M parameter, sub‑200ms CPU) Pareto frontier. Source code: https://github.com/Enrikkk/spectranet

Authors:Runyao Yu, Julia Lin, Derek W. Bunn, Jochen Stiasny, Wentao Wang, Yujie Chen, Tara Esterl, Peter Palensky, Jochen L. Cremer
Title: A Market-Rule-Informed Neural Network for Efficient Imbalance Electricity Price Forecasting
Abstract:
Accurate and efficient imbalance electricity price forecasting is critical for industrial energy trading systems, especially as battery assets and automated bidding pipelines increasingly participate in balancing markets. However, real‑time forecasting is complicated by nonlinear market‑rule‑based price formation, heterogeneous input signals, and incomplete data availability caused by communication delays, publication lags, and measurement outages. This paper proposes a market‑rule‑informed neural forecasting framework that embeds imbalance price formation rules into the latent space of an expressive neural network. The proposed framework preserves raw signal information while exploiting transparent market‑rule priors. We further analyze operational robustness by removing price‑component information and characterize how forecasting performance scales with input length and forecasting horizon. Experimental results show that the proposed model achieves competitive forecasting performance with substantially fewer trainable parameters and shorter training time than generic deep learning baselines. Experimental results show that the proposed model achieves competitive forecasting performance with substantially fewer trainable parameters and shorter training time than generic deep learning baselines, demonstrating that market‑rule priors and expressive neural networks should be jointly used for accurate and computationally sustainable forecasting in industrial energy trading applications. The implementation is publicly available at https://runyao‑yu.github.io/MRINN/.

Authors:Jiahe Chen, Ziye Ma
Title: Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
Abstract:
Zeroth‑order (ZO) optimization has become increasingly popular and important in fine‑tuning large language models (LLMs), especially on edge devices due to its ability to adjust the model to local data without the need for memory‑intensive back‑propagation. Recent works try to reduce ZO variance through low‑dimensional subspace search, but subspace restriction alone leaves key optimization geometry under‑exploited, motivating additional acceleration. In this work, we focus on the hidden layer training problem in which spectral optimizers like Muon outperform AdamW due to its ability to exploit weak spectral directions by orthogonalization. However, we have discovered that unlike in the first‑order setting, full orthogonalization works poorly in the ZO setting since the gradient estimates are highly noisy and unreliable. To address this issue, we propose a key approach we call partial orthogonalization. To do so, we replace the iconic Newton‑Schulz procedure in Muon with the faster, more concentrated power‑iteration method so that it only amplifies dominant spectral directions. Furthermore, to improve the efficiency and generalization of the algorithm, we adopted a streaming variant of power‑iteration that requires low variance in gradients, which was achieved through constraining our search inside a subspace obtained through the projection of momentum, echoing recent advances. Experiments on LLM fine‑tuning show that our method can achieve from 1.5x to 4x the convergence speed of ZO‑Muon, the current SOTA algorithm, across SuperGlue datasets in the OPT‑13B model. Across different models, we also reach competitive final accuracies with less time in most cases compared with strong ZO baselines such as MeZO, LOZO and ZO‑Muon. Code is available at https://github.com/MOFA‑LAB/ZO‑MOPI.git.

Authors:Dongcheng Zhang, Yi Zhang, Yuxin Chen, An Zhang, Xiang Wang, Chaochao Lu
Title: Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
Abstract:
Large Reasoning Models possess remarkable capabilities for self‑correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine‑tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on‑policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self‑ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self‑ReSET significantly enhances robustness against adversarial attacks especially out‑of‑distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self‑recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self‑ReSET.

Authors:Xinyu Li, Ronghui Mu, Lin Li, Tianjin Huang, Gaojie Jin
Title: OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
Abstract:
Large Language Models (LLMs) are increasingly deployed as autonomous agents that execute tool‑augmented, multi‑step tasks, where latency is a critical factor for real‑world applications. Yet an overlooked threat is Reasoning‑Level Denial‑of‑Service (R‑DoS), in which an attacker preserves task correctness but degrades availability by inflating an agent's reasoning depth or tool‑use budget. We introduce OTora, the first unified, two‑stage red‑teaming framework for instantiating R‑DoS attacks. Stage I optimizes an adversarial trigger that induces targeted tool invocations using insertion‑aware scoring and dynamic target co‑evolution, supporting both black‑box and white‑box settings. Stage II generates agent‑aware reasoning payloads via an ICL‑guided genetic search that amplifies overthinking while maintaining correct task outcomes. Across WebShop, Email, and OS agents built on multiple backbone models such as LLaMA‑70B and GPT‑OSS‑120B, OTora achieves up to 10 times increases in reasoning tokens and order‑of‑magnitude latency slowdowns, all while preserving near‑baseline task accuracy. Finally, we discuss mitigation strategies for detecting and constraining abnormal reasoning and latency spikes. The code is available at https://github.com/llm2409/OTora.

Authors:Shota Fujikawa, Issei Sato
Title: Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection
Abstract:
Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state‑of‑the‑art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token‑level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state‑of‑the‑art baselines through adaptive aggregation of internal feature representations. Code is available at https://github.com/FUJI1229/Hallucination_Detection.

Authors:Yulang Chen, Haoxuan Peng, Jinyan Liu, Zichen Wen, Dongrui Liu, Linfeng Zhang
Title: AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems
Abstract:
Large Language Model‑based Multi‑Agent Systems (MAS) have demonstrated remarkable capabilities in complex tasks. However, manually designing optimal communication topologies is labor‑intensive, while automated expansion methods often result in bloated structures with redundant agents, leading to excessive token consumption. To address this problem, we introduce AgentSlimming, a plug‑and‑play compression framework for graph‑structured multi‑agent workflows. Motivated by pruning and quantization in neural networks, AgentSlimming compresses workflows by first estimating the importance score of each agent with a hybrid mechanism, and then removes redundant agents or replaces them with low‑cost ones, where each operation is validated using a baseline‑anchored acceptance rule to prevent performance collapse. Experiments show that AgentSlimming reduces average token cost by up to 78.9% with negligible performance degradation, and sometimes even improves accuracy, achieving a strong Pareto‑optimal trade‑off between cost and quality. Our code is publicly available at https://github.com/CitrusYL/AgentSlimming

Authors:Renjie Gu, Jiazhen Du, Yihua Zhang, Sijia Liu
Title: Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
Abstract:
Unlearning in large language models (LLMs) aims to remove harmful training data while preserving overall utility. However, we find that existing methods often hallucinate, generate abnormal token sequences, or behave inconsistently, raising safety and trust concerns. According to prior literature on LLM honesty, such behaviors are often associated with dishonesty. This motivates us to investigate the notion of honesty in the context of model unlearning. We propose a formal definition of unlearning honesty, which includes: (1) preserving both utility and honesty on retained knowledge, and (2) ensuring effective forgetting while encouraging the model to acknowledge its limitations and respond consistently to questions related to forgotten knowledge. To systematically evaluate the honesty of unlearning, we introduce a suite of metrics that cover utility, honesty on the retained set, effectiveness of forgetting, rejection rate and refusal stability in Q&A and MCQ settings. Evaluating 9 methods across 3 mainstream families shows that all current methods fail to meet these standards. After experimental and theoretical analyses, we present ReVa, a representation‑alignment procedure that fine‑tunes feature‑randomized unlearned models to better acknowledge forgotten knowledge. On Q&A tasks from the forget set, ReVa achieves the highest rejection rate after two rounds of interaction, nearly doubling the performance of the second‑best method. Remarkably, It also improves honesty on the retained set. We release our data and code at https://github.com/renjiegu.

Authors:Vladimir Iglovikov
Title: Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders
Abstract:
JPEG decode is routine ML infrastructure, but Python decoder choices are often justified by single‑process, single‑thread microbenchmarks. We audit this evaluation assumption with twelve Python‑accessible JPEG decode paths on five matched 16 vCPU Google Cloud CPUs: Intel Emerald Rapids, AMD Zen 4, AMD Zen 5, ARM Neoverse V2, and ARM Neoverse N1. ImageNet validation is the workload, not a new dataset contribution: each run decodes the full 50,000‑image split from memory and reports single‑thread throughput for all decoders, PyTorch DataLoader throughput for eligible decoders at worker counts 0,2,4,8, and decoder skip behavior. The evaluation protocol changes the supported conclusion. On Neoverse V2, imageio is ninth in single‑thread throughput yet lands in the top DataLoader tier with torchvision; on Zen 4, torchvision rises from seventh single‑thread to the top measured DataLoader tier; on Neoverse N1, imagecodecs is the single‑thread leader but fourth at peak DataLoader throughput. We also find that worker‑count conclusions differ between Zen 4 and Zen 5, TensorFlow has a large single‑thread ARM penalty, and strict libjpeg‑turbo‑family wrappers reject the same rare ImageNet JPEG. For PyTorch DataLoader workloads, torchvision and simplejpeg form the strongest measured zero‑skip tier: torchvision has the highest mean normalized throughput, while simplejpeg has the highest minimum. OpenCV remains a robust general‑purpose fallback above 90% of the platform‑local winner on every tested CPU. We release raw JSON, generated tables/figures, and an executable local/cloud benchmark framework.

Authors:Weidong Zheng, Kongyang Chen, Yuanwei Guo, Yatie Xiao
Title: Classification-Head Bias in Class-Level Machine Unlearning: Diagnosis, Mitigation, and Evaluation
Abstract:
Class‑level machine unlearning aims to remove the influence of specified classes while preserving model utility on retained classes. Existing methods are commonly evaluated by retain‑set accuracy, forget‑set accuracy, and unlearning time, but these metrics provide limited insight into how forgetting is achieved internally. In this paper, we reveal a bias‑dominated shortcut in class‑level unlearning: the prediction of forgotten classes can be suppressed by decreasing the corresponding bias terms in the final classification head. We first analyze the gradient dynamics of classification‑head biases under softmax cross‑entropy training, explaining why retain‑set‑only optimization tends to reduce the biases of absent classes. Based on this observation, we introduce BiasShift as a diagnostic baseline, showing that simple bias manipulation can satisfy conventional unlearning metrics while leaving abnormal bias patterns that reveal forgotten labels. To mitigate excessive forgotten‑class bias suppression, we propose two bias‑aware mechanisms, namely Two‑Stage Bias Gradient Reversal Mechanism (TS‑BGRM) and Lower‑Bound Hinge Regularization (LB‑HR). We further introduce three bias‑oriented metrics, including Bias Stability Coefficient (BSC), Median Bias Gap (MBG), and Minimal Bias Score (MBS), to quantify bias dependence and potential leakage. Experiments on CIFAR‑10, CIFAR‑100, and Tiny‑ImageNet demonstrate that the proposed methods maintain competitive unlearning performance while producing more stable bias distributions. We have released our code at https://github.com/zwd2024/Beyond‑the‑Shadow‑of‑Bias‑From‑Classification‑Head‑Bias‑to‑Parameter‑Redistribution.

Authors:Jiaming Liang, Chi-Man Pun, Weisi Lin, Greta Seng Peng Mok
Title: Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression
Abstract:
Learned image compression (LIC) integrates deep neural networks (DNNs) to map high‑dimensional images into compact latent representations, reducing redundancy and achieving superior rate‑distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low‑resolution (3×28×28) global semantic manipulation (GSM). However, high‑resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near‑perfect white‑box attacks for classification, segmentation, and other tasks, yet fails to generalize to high‑resolution GSM. Our theoretical and empirical analyses reveal that well‑performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying‑Oscillating‑Refining stages. General \ell_\infty‑bounded attacks fail on high‑resolution GSM because their step‑size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables \ell_\infty‑bounded high‑resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD^2‑GSM. Extensive experiments on the Kodak (3×768×512) demonstrate that our PGD^2‑GSM is the first to stably achieve high‑resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at https://github.com/chinaliangjiaming/PGD2‑GSM.

Authors:Chengcheng Sun, Chenhao Li, Xiang Lin, Tianji Zheng, Fanrong Meng, Xiaobin Rui, Zhixiao Wang
Title: Attention-based graph neural networks: a survey
Abstract:
Graph neural networks (GNNs) aim to learn well‑trained representations in a lower‑dimension space for downstream tasks while preserving the topological structures. In recent years, attention mechanism, which is brilliant in the fields of natural language processing and computer vision, is introduced to GNNs to adaptively select the discriminative features and automatically filter the noisy information. To the best of our knowledge, due to the fast‑paced advances in this domain, a systematic overview of attention‑based GNNs is still missing. To fill this gap, this paper aims to provide a comprehensive survey on recent advances in attention‑based GNNs. Firstly, we propose a novel two‑level taxonomy for attention‑based GNNs from the perspective of development history and architectural perspectives. Specifically, the upper level reveals the three developmental stages of attention‑based GNNs, including graph recurrent attention networks, graph attention networks, and graph transformers. The lower level focuses on various typical architectures of each stage. Secondly, we review these attention‑based methods following the proposed taxonomy in detail and summarize the advantages and disadvantages of various models. A model characteristics table is also provided for a more comprehensive comparison. Thirdly, we share our thoughts on some open issues and future directions of attention‑based GNNs. We hope this survey will provide researchers with an up‑to‑date reference regarding applications of attention‑based GNNs. In addition, to cope with the rapid development in this field, we intend to share the relevant latest papers as an open resource at https://github.com/sunxiaobei/awesome‑attention‑based‑gnns.

Authors:Naoki Masuyama, Yusuke Nojima, Stefan Wermter, Yuichiro Toda, Hisao Ishibuchi, Chu Kiong Loo
Title: PHIDA: Persistence-Guided Node-to-Cluster Mapping for Online Clustering
Abstract:
Online clustering methods that adaptively create and update nodes as data arrive often make node learning explicit, whereas the mapping from the learned node state to output clusters often remains implicit or simplified. Implicit mappings make output clusters sensitive to weak graph bridges or local relations based on distance in the graph over learned nodes, leaving no explicit constraint on which node groups remain intact during mapping. This paper addresses this gap by proposing PHIDA, a persistence‑guided node‑to‑cluster mapping method for online clustering with learned nodes. PHIDA implements this mapping within Adaptive Resonance Theory (ART)‑based online clustering by combining Inverse‑Distance ART (IDA) node learning with node‑to‑cluster mapping constrained by Persistent Homology (PH). Experiments on 24 benchmark datasets show that PHIDA achieves the best average ranks in stationary comparisons that include the recent stationary‑only clustering methods, while also improving aggregate performance in the nonstationary setting over the evaluated online methods that adaptively create and update nodes. Ablations and comparisons with conventional node‑to‑cluster mappings indicate that the observed gains are associated with PH‑constrained mapping that preserves raw PH components, together with the use of the PH component view during node learning. Source code is available at https://github.com/Masuyama‑lab/PHIDA

Authors:Michael Groom, Victor-Alexandru Darvariu, Lars Kunze, James Wilson, Nick Hawes
Title: Quantile-Coupled Flow Matching for Distributional Reinforcement Learning
Abstract:
Unlike standard expected‑return Reinforcement Learning (RL), Distributional RL (DRL) models the full return distribution, making it better‑suited for uncertainty‑aware and risk‑sensitive decision‑making. Conditional Flow Matching (CFM) critics have recently attracted attention for modelling continuous, multi‑modal return distributions. Despite this interest, there remains a substantial metric mismatch: DRL theory relies on the distributional Bellman operator being contractive in the p‑Wasserstein distance, yet existing CFM critics are trained with arbitrary source‑target couplings, so their flow‑matching losses are not Wasserstein‑aligned surrogates for matching Bellman target return distributions. In this work, we address this mismatch by proposing FlowIQN, a CFM critic that sorts source and Bellman target samples within each mini‑batch to approximate the monotone optimal transport coupling, replacing arbitrary pairings with quantile‑aligned flow paths. We prove that the loss of our quantile‑coupled CFM critic yields a Wasserstein‑aligned approximate projection compatible with the foundations of DRL. To our knowledge, FlowIQN is the first flow‑matching distributional critic with an explicit Wasserstein‑aligned projection guarantee. We further extend FlowIQN with shortcut models for efficient inference. Empirical results show that FlowIQN improves Wasserstein return‑distribution accuracy over other CFM critics. It also yields competitive performance on offline RL benchmarks across multiple policy extraction methods, providing a theoretically grounded CFM critic that is readily compatible with DRL pipelines. Code: https://github.com/ori‑goals/flowIQN.

Authors:Wenhao Wu, Zishan Shao, Kangning Cui, Jinhee Kim, Yixiao Wang, Hancheng Ye, Danyang Zhuo, Yiran Chen
Title: FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast
Abstract:
SVD‑based Low‑rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD‑compressed transformers. FlashSVD v1.5 maps diverse public SVD compression families to a common factorized representation and combines phase‑specific kernels with dense‑KV decode, packed MLP execution, and per‑layer CUDA‑graph replay to reorganize the low‑rank serving path into a thin runtime. Across representative decoder‑serving settings, FlashSVD v1.5 achieves up to 2.55x decode and 2.39x end‑to‑end speedup, and it attains 1.48x average decode and 1.44x average end‑to‑end speedup across multiple popular SVD compression families. These results suggest that practical low‑rank acceleration requires runtime co‑design, not compression algorithms alone. Our code is available at: https://github.com/Zishan‑Shao/FlashSVD.

Authors:Siyu Wu, Yulong Ye, Zezhen Xiang, Pengzhou Chen, Gangda Xiong, Tao Chen
Title: LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems
Abstract:
Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non‑AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real‑world LLM systems, dubbed LLMSYS‑HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS‑HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12‑23, 3‑5 dimensions of fidelity factor leading to 932 settings, 3‑9 inference objective metrics, and 2‑10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: https://github.com/ideas‑labo/llmsys‑hpobench

Authors:Abdulvahap Mutlu, Şengül Doğan, Türker Tuncer
Title: mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
Abstract:
Manifold‑Constrained Hyper‑Connections (mHC) introduce a stability‑motivated variant of multi stream residual mixing by constraining residual stream mixing matrices to the manifold of doubly stochastic matrices via Sinkhorn‑Knopp projection. In his work, we study whether mHC‑style constrained multi‑stream residual topology transfers effectively to state space model (SSM) language modeling. We implement a static mHC mechanism around an SSM block by expanding the residual stream into multiple parallel streams, aggregating streams into a single SSM input through simplex‑constrained pre‑mixing, scattering the SSM output back to streams through simplex‑constrained post‑mixing, and applying Sinkhorn‑projected residual stream mixing at each layer. We further introduce stream‑specialized adapters that add lightweight stream‑specific capacity through a shared bottleneck with per‑stream scaling, applied both before stream aggregation and after the SSM output prior to scattering. We evaluate baseline single‑stream SSM, static mHC SSM, and mHC SSM with adapters on WikiText‑2 using identical training settings and report checkpoint‑based validation loss, perplexity, throughput, and peak GPU memory. Under the reported fair checkpoint evaluation, static mHC improves validation loss from 6.3507 to 6.2448 and reduces perplexity from 572.91 to 515.35, while mHC with adapters further improves validation loss to 6.1353 and perplexity to 461.88. These gains are accompanied by modest throughput reductions from 1025.52 to 964.81 and 938.90 tokens per second, and increased peak memory from 2365 MB to 2568 MB and 3092 MB. The results suggest that mHC‑inspired constrained multi‑stream residual mixing can yield measurable quality improvements in SSM language models and that stream‑specialized adapter capacity can further enhance performance with predictable efficiency tradeoffs.

Authors:Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu, Yao Hu, Chongyang Zhang
Title: HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain‑of‑Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration‑exploitation trade‑off during learning. To this end, we propose Hierarchical Token‑level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide‑and‑conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token's expected functionality. In this way, HTPO can achieve a more balanced exploration‑exploitation trade‑off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME'24 and AIME'25, respectively). When scaling test‑time compute, the HTPO‑trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token‑level control method fosters effective exploration without sacrificing exploitation performance. Code will be at https://github.com/xcyao00/HTPO.

Authors:Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail, Yi Xia, Emily Huang
Title: Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
Abstract:
A pervasive intuition holds that vision‑language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention‑Confidence Assumption directly. We instrument three open‑weight VLM families (LLaVA‑1.5, PaliGemma, Qwen2‑VL; 3‑7B parameters) with a unified mechanistic pipeline ‑‑ the VLM Reliability Probe (VRP) ‑‑ that compares attention structure, generation dynamics, and hidden‑state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near‑zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [‑0.034,0.036]; R_pb(H_s,y)=‑0.012, [‑0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top‑30% patch masking drops accuracy by 8.2‑11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden‑state linear probe reaches AUROC>0.95 on POPE for two of three families, and self‑consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron‑level ablations expose a sharp architectural split with direct monitor‑design implications: late‑fusion LLaVA concentrates reliability in a fragile late bottleneck (‑8.3 pp object‑identification accuracy after top‑5 probe‑neuron ablation), whereas early‑fusion PaliGemma and Qwen2‑VL distribute it widely and absorb destruction of ~50% of their peak‑layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3‑7B VLMs, reliability is read more reliably off hidden‑state geometry, layer‑wise margin formation, and sparse late‑layer circuits than off attention‑map sharpness.

Authors:Farjana Yesmin
Title: FairHealth: An Open-Source Python Library for Trustworthy Healthcare AI in Low-Resource Settings
Abstract:
We present FairHealth, an open‑source Python library that provides a unified, modular framework for trustworthy machine learning in healthcare applications, with particular focus on low‑resource and low‑income country (LMIC) settings such as Bangladesh. FairHealth addresses four critical gaps in existing healthcare AI toolkits: (1) the absence of integrated fairness auditing for biosignals and clinical tabular data; (2) the lack of privacy‑preserving federated learning tools compatible with standard ML workflows; (3) missing explainability tools tailored for low‑bandwidth clinical decision support; and (4) no existing toolkit covering Global South healthcare datasets. Built from five peer‑reviewed research contributions, FairHealth provides six modules covering federated learning with homomorphic encryption (fairhealth.federated), intersectional fairness metrics (fairhealth.fairness), hybrid fuzzy‑SHAP explainability (fairhealth.explain), multilingual dengue triage (fairhealth.lowresource), equitable disaster aid allocation (fairhealth.equity), and public dataset loaders (fairhealth.datasets). All datasets used are publicly available without institutional data use agreements. FairHealth is installable via pip install fairhealth(PyPI: pypi.org/project/fairhealth/) and available at https://github.com/Farjana‑Yesmin/fairhealth.

Authors:Bo Ye, Kai Gan, Tong Wei, Min-Ling Zhang
Title: Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
Abstract:
Generalized Category Discovery (GCD) seeks to identify novel categories from unlabeled data while retaining the classification ability of seen categories. Prior GCD methods commonly leverage transferable representations from pre‑trained models, adapting to downstream datasets via partial fine‑tuning (updating only the final ViT block) and visual prompt tuning (appending learnable vectors to inputs). However, conventional partial fine‑tuning offers limited flexibility, as it fails to adapt the entire model; meanwhile, visual prompt tuning is prone to overfitting, due to its sensitivity to initialization and inherently constrained capacity. To address these limitations, we propose LAGCD, a simple yet effective GCD approach that embeds a residual linear adapter into each ViT block. From the perspective of feature sparsity, we systematically show that non‑linearity in conventional adapters impairs performance, whereas our linear adapter enhances it by enabling more flexible model capacity. We further introduce an auxiliary distribution alignment loss to mitigate the negative impact of biased predictions between seen and novel categories. Extensive experiments on both generic and fine‑grained datasets confirm that LAGCD consistently improves performance over many sophisticated baselines. The source code is available at https://github.com/yebo0216best/LAGCD

Authors:Weicai Yan, Xinhua Ma, Wang Lin, Tao Jin
Title: Text-Guided Multi-Scale Frequency Representation Adaptation
Abstract:
Parameter‑efficient fine‑tuning methods introduce a small number of training parameters, enabling pre‑trained models to adapt rapidly to new data distributions. While these methods have shown promising results, they exhibit notable limitations. First, most existing methods operate in the signal space domain, which results in substantial information redundancy. Second, most existing methods utilize fixed prompts or adaptation layers, failing to fully account for the multi‑scale characteristics of signals. To address these challenges, we propose the Multi‑Scale Frequency Adapter (FreqAdapter), which integrates textual information and performs multi‑scale fine‑tuning of signals in the frequency domain. Additionally, we introduce a multi‑scale adaptation strategy to optimize receptive fields across different frequency ranges, further enhancing the model's representational capacity. Extensive experiments on multimodal models, including CLIP and LLaVA, demonstrate that FreqAdapter significantly improves both performance and efficiency. FreqAdapter improves performance with minimal cost and fast convergence within one epoch. Code is available at https://github.com/Kelvin‑ywc/FreqAdapter.

Authors:Ayoub Agouzoul
Title: Understanding Asynchronous Inference Methods for Vision-Language-Action Models
Abstract:
Vision‑Language‑Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference‑time inpainting (IT‑RTC), training‑time delay simulation (TT‑RTC), future‑state‑aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA, sweeping inference delays up to d=20 control steps. A2C2's per‑step residual correction is the most effective method on Kinetix, holding above 90% solve rate up to d=8, and also leads on LIBERO from d=4 onwards. IT‑RTC is competitive at low delays but degrades sharply under long chunks (H=30) and high delays. TT‑RTC is the most robust training‑based method: stable across d_\max choices, generalizes beyond its training delay distribution, and adds zero inference overhead. VLASH exhibits a clear low‑delay vs. high‑delay trade‑off governed by the fine‑tuning delay range [0,d_\max]. Code is available at https://github.com/TheAyos/async‑vla‑inference

Authors:Jonathan Bates
Title: A PyTorch Library of Turing-Complete Neural Networks
Abstract:
We present a PyTorch package that compiles neural networks and their weights from Turing machine descriptions, producing models that exactly simulate the specified machine without any training. Given a transition function and a set of terminal states, the package constructs a model whose forward pass corresponds to one step of the Turing machine. Two architectures are implemented, each realizing a different theoretical result: (1) a transformer with self‑attention, cross‑attention, and feedforward layers based on Wei, Chen, and Ma (2021), and (2) a recurrent network based on Siegelmann and Sontag (1995) that encodes the stack in a Cantor set. We develop the constructions from first principles, showing how ReLU networks implement Boolean circuits (AND, OR, NOT, XOR gates and their composition into DNF formulas and binary adders) and how hard attention implements positional lookup on the tape. The package serves as a concrete, runnable reference for the symbolic‑neural bridge, and as a foundation for future work on the stability of constructed solutions under gradient‑based optimization. Code is available at https://github.com/jonrbates/turing.

Authors:Yuan Fang, Yi Xie, Xuming Ran
Title: HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing
Abstract:
Large language models encode vast factual knowledge that inevitably becomes outdated or incorrect after deployment, yet retraining is costly prohibitive, motivating model editing in lifelong settings that updates targeted behavior without harming the rest of the model. One line of work installs new facts by directly modifying base weights through locate‑then‑edit procedures, but accumulated edits progressively disrupt originally preserved knowledge, even with constraint‑based projections. A complementary line leaves base weights intact and routes edits through external memory, but it faces routing challenges and its performance degrades at scale. We propose HoReN, a codebook‑based parameter‑preserving editor with enhanced routing built on three ideas. First, HoReN wraps a single MLP layer with a discrete key‑value codebook, where each entry is interpreted simultaneously as a knowledge‑memory key and a modern Hopfield stored pattern. Second, both keys and queries are projected onto the unit hypersphere so retrieval is governed by angular similarity, removing magnitude‑driven mismatches between an edit prompt and its rephrasings. Third, the query is refined through damped Hopfield attractor dynamics, so paraphrases relax into the correct stored pattern's basin of attraction while unrelated queries remain undisturbed. HoReN achieves well‑edited performance with consistent gains across diverse benchmarks spanning standard ZsRE, structured WikiBigEdit, and unstructured UnKE evaluations. Moreover, HoReN scales to 50K sequential edits on ZsRE with stable overall performance above 0.9, while prior editors collapse or degrade severely before reaching 10K. Our code is available at https://github.com/ha11ucin8/HoReN.

Authors:Natalia Frumkin, Bokun Wang, Hung-Yueh Chiang, Chi-Chih Chang, Mohamed S. Abdelfattah, Diana Marculescu
Title: DARE: Diffusion Language Model Activation Reuse for Efficient Inference
Abstract:
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to auto‑regressive (AR) models, offering greater expressive capacity and potential for parallel generation and faster inference. However, open‑source dLLMs remain immature, lagging behind AR models in both efficiency and quality. We identify an underexplored property of dLLMs: token‑wise redundancy in bi‑directional self‑attention. Self‑attention activations are highly correlated across tokens, and temporal changes in query representations can predict redundancy in corresponding key, value, and output activations. We introduce DARE, with two complementary mechanisms: DARE‑KV, which reuses cached key‑value (KV) activations, and DARE‑O, which reuses output activations to reduce redundant computation while preserving quality. DARE achieves up to 1.20x per‑layer latency reduction and reuses up to 87% of attention activations, with negligible degradation on reasoning and code‑generation benchmarks. DARE‑KV and DARE‑O incur average performance drops of only 2.0% and 1.2%, respectively. Combined with techniques such as prefix caching and Fast‑dLLM, DARE provides additive gains without retraining. These results establish token‑wise reuse as an effective strategy for improving the efficiency of diffusion‑based LLMs while preserving generation fidelity. Code: https://github.com/enyac‑group/DARE

Authors:Chao Tang, Jianzong Wu, Qingyu Shi, Ye Tian, Aixi Zhang, Hao Jiang, Jiangning Zhang, Yunhai Tong
Title: Towards Customized Multimodal Role-Play
Abstract:
Unified multimodal understanding and generation models enable richer human‑AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role‑Play (CMRP). We construct the RoleScape‑20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text‑image interactions. Building on a unified model, we devise UniCharacter, a two‑stage training framework containing Unified Supervised Finetuning (Unified‑SFT) and character‑specific group relative policy optimization (Character‑GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape‑20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross‑modal consistency design and few‑shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next‑generation characterful and immersive interactive agents.

Authors:Peng Liao, Shangsong Liang, Lin Chen, Peijia Zheng
Title: Modular Retrieval-Augmented Generalization for Human Action Recognition
Abstract:
Inertial Measurement Unit (IMU)‑based Human Activity Recognition (HAR) aims to interpret and classify user behaviors from temporal motion signals. Recently, deep learning frameworks have advanced this task by learning and extracting discriminative spatiotemporal representations, significantly improving recognition performance. However, IMU‑based HAR still faces several critical challenges, particularly limited training samples and static knowledge utilization, both of which severely hinder its large‑scale deployment. In this paper, we introduce MoRA, the first Retrieval‑Augmented Module specifically designed for motion series. It can be flexibly integrated into any existing HAR model, enhancing recognition performance while maintaining inference efficiency. To address issues such as information redundancy in retrieval results and rigid fusion strategies, we propose an uncertainty‑adaptive fusion unit within MoRA. This unit leverages previous physical knowledge from IMU signals to dynamically adjust the fusion strategy between original outputs and retrieved information, enabling more robust recognition. Extensive experiments on ten real‑world datasets demonstrate that MoRA significantly improves the performance of existing IMU‑based HAR models, consistently delivering stable and effective gains. The source code of MoRA is available at: https://github.com/liavonpenn/mora.

Authors:Amman Yusuf, Zhejun Jiang, Mijung Park
Title: The Safety-Aware Denoiser for Text Diffusion Models
Abstract:
Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post‑hoc filtering or inference‑time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety‑Aware Denoiser (SAD), a safety‑guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference‑time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

Authors:Drew Dillon, Kasyap Varanasi
Title: Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%
Abstract:
AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team‑specific product decisions that are invisible in the source code alone. We introduce a controlled benchmark measuring decision compliance, the rate at which an AI coding agent follows established product, design, and engineering decisions, across 8 realistic software engineering tasks containing 41 weighted decision points. We compare a baseline configuration (Claude Code with codebase access only) against an augmented configuration that adds Brief, a product‑context retrieval system providing spec generation, mid‑build consultation, and retrieval of recorded decisions, persona pain points, customer signals, and competitive intelligence. On identical prompts and the same repository, the augmented configuration achieves 95% decision compliance versus 46% for the baseline, a 49 percentage point improvement. Per‑decision analysis reveals that the baseline achieves 100% compliance on decisions visible in the codebase and 0‑33% on decisions requiring product context, suggesting that product‑context retrieval is a key driver of the improvement. We release the benchmark repository, all 16 pull requests, and scoring harness for independent reproduction.

Authors:Atsushi Nitanda, Dake Bu, Yueming Lyu, Tanya Veeravalli
Title: Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation
Abstract:
We study Slowly Annealed Langevin Dynamics (SALD), a sampler for tracking a path of moving target distributions and approximating the terminal target through time slowdown. We establish non‑asymptotic convergence guarantees via a KL differential inequality, showing that slowdown improves tracking through contraction of intermediate targets and the complexity of the path. Motivated by training‑free guided generation with pretrained score‑based generative models, we further introduce Velocity‑Aware SALD (VA‑SALD), which explicitly incorporates the underlying marginal distributions of the pretrained model and uses slowdown to correct the additional deviation induced by guidance. This yields a principled framework for training‑free guided generation for diffusion‑based and related generative model families, together with convergence guarantees that clarify the roles of intermediate functional inequalities and guidance bias. Code is available at https://github.com/anitan0925/sald.

Authors:Giacomo Spigler
Title: TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
Abstract:
Active vision ‑‑ where a policy controls its own gaze during manipulation ‑‑ has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active‑vision imitation learning, with two complementary task suites ‑‑ TAVIS‑Head (5 tasks, global search via pan/tilt necks) and TAVIS‑Hands (3 tasks, local occlusion via wrist cameras) ‑‑ on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam‑vs‑fixedcam protocol on identical demonstrations; GALT (Gaze‑Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and π_0 reveal that (i) active‑vision generally helps, but benefits are task‑conditional rather than uniform; (ii) multi‑task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis‑benchmark.

Authors:Ionut-Vlad Modoranu, Mher Safaryan, Dan Alistarh
Title: MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
Abstract:
With the rise in scale for deep learning models to billions of parameters, the computational cost of fine‑tuning remains a significant barrier to deployment. While Low‑Rank Adaptation (LoRA) has become the standard for parameter‑efficient fine‑tuning, the need to set a predefined, static rank r requires exhaustive grid searches to balance efficiency and performance. Existing rank‑adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub‑optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data‑inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka‑inspired training framework for LoRA that learns accurate hierarchical low‑rank representations by inserting a fixed, carefully crafted diagonal matrix P between the existing LoRA adapters to scale their sub‑ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing P and ensures all sub‑ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low‑rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low‑rank representations than prior rank‑adaptive approaches and achieves superior accuracy‑performance trade‑offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST‑DASLab/MatryoshkaLoRA.

Authors:Naoto Iwase, Yuki Ichihara, Mohammad Atif Quamar, Junpei Komiyama
Title: Reliable Chain-of-Thought via Prefix Consistency
Abstract:
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain‑of‑Thought (CoT) traces and aggregating them with majority voting (MV), a test‑time technique called self‑consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log‑probabilities or self‑rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto‑iwase/prefix‑consistency.

Authors:Christopher Ries, Moussa Kassem Sbeyti, Nicolas Bianco, Nadja Klein
Title: Probabilistic Object Detection with Conformal Prediction
Abstract:
Conformal Prediction (CP) is a distribution‑free method for constructing prediction sets with marginal finite‑sample coverage guarantees, making it a suitable framework for reliable uncertainty quantification in safety‑critical object detection. However, object detection introduces structured multi‑output predictions, complicating the application of classical CP theory developed for single outputs. In addition, standard, unscaled CP produces fixed‑width prediction intervals across inputs, leading to unnecessary width for low‑uncertainty predictions. While scaled CP addresses this by adapting the interval width to an input‑dependent uncertainty estimate, prior work has neither systematically compared unscaled and scaled CP for multi‑class object detection, nor integrated CP with a complementary uncertainty quantification method in this setting. We fill this gap by: (i) applying CP coordinate‑wise to bounding box corners with a Bonferroni correction for box‑level guarantees; (ii) scaling the resulting intervals using per‑prediction aleatoric uncertainty estimates derived from a probabilistic object detector trained with loss attenuation, evaluated in uncalibrated and two calibrated variants; (iii) extending to a two‑step pipeline that constructs prediction sets for the class using RAPS and conditions the conformalized bounding boxes on the predicted class set. Across three autonomous driving datasets (KITTI, BDD, CODA), including a cross‑domain setting under distribution shift, scaled CP consistently improves interval sharpness over unscaled CP, achieving up to 19% higher IoU and 39% lower interval scores, without sacrificing coverage. Class‑wise calibration further improves coverage for both variants with a negligible effect on sharpness. Together, these improvements yield more actionable uncertainty estimates for real‑time, real‑world object detection.

Authors:Fengqiang Wan, Yipeng Lin, Kan Lv, Yang Yang
Title: SR$^2$-LoRA: Self-Rectifying Inter-layer Relations in Low-Rank Adaptation for Class-Incremental Learning
Abstract:
Pre‑trained models with parameter‑efficient fine‑tuning (PEFT) have demonstrated promising potential for class‑incremental learning (CIL), yet catastrophic forgetting still persists when adapting models to new tasks. In this paper, we present a novel perspective on catastrophic forgetting through the analysis of inter‑layer relation drift, i.e., the progressive disruption of relationships among layer‑wise representations during the learning of new tasks. We theoretically show that the increase of such drift reduces the classification margins of previously learned tasks, thereby degrading overall model performance. To address this issue, we propose \underlineSelf‑\underlineRectifying inter‑layer \underlineRelation Low‑Rank Adaptation~(SR^2‑LoRA), a simple yet effective method that mitigates catastrophic forgetting by constraining inter‑layer relation drift. Specifically, SR^2‑LoRA constructs the relation matrices induced by the previous and current models on current‑task samples, and aligns the corresponding singular values. We further theoretically show that this alignment exhibits greater robustness to estimation perturbations than direct entry‑wise alignment. Extensive experiments on standard CIL benchmarks demonstrate that SR^2‑LoRA effectively mitigates catastrophic forgetting, with its advantages becoming more pronounced as the number of tasks increases. Code is available in the \hrefhttps://github.com/FqWan24/SR‑2‑LoRArepository.

Authors:Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, Tat-Seng Chua
Title: Rubric-based On-policy Distillation
Abstract:
On‑policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white‑box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher‑generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric‑based OPD. Specifically, ROPD induces prompt‑specific rubrics from teacher‑student contrasts, and then utilizes these rubrics to score the student rollouts for on‑policy optimization. Empirically, ROPD outperforms the advanced logit‑based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric‑based OPD as a flexible, black‑box‑compatible alternative to the prevailing logit‑based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open‑source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.

Authors:Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei
Title: MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
Abstract:
DeepSeek Sparse Attention (DSA) sets the state of the art for fine‑grained inference‑time sparse attention by introducing a learned token‑wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek‑V3.2) that share the same selected token set; this multi‑head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop‑in replacement for the DSA indexer that treats its indexer heads as a pool of mixture‑of‑experts. A lightweight router uses cheap block‑level statistics to pick a query‑dependent subset of only a few active heads, and only those heads run the heavy token‑level scoring. This preserves the diversity of the original indexer pool while reducing the per‑query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re‑ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek‑V3.2 and GLM‑5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle‑in‑a‑Haystack heatmaps up to a 128K‑token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.

Authors:Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang
Title: Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
Abstract:
Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post‑training. Central to these approaches is the design of the importance sampling (IS) ratio used in off‑policy policy‑gradient estimation. Existing methods face a fundamental bias‑variance dilemma: token‑level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory‑level correction but suffer from high variance due to the multiplicative accumulation of per‑token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full‑sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per‑token ratios up to position t, as a theoretically principled solution to this dilemma. We prove that, under the token‑level policy‑gradient formulation, this ratio provides an unbiased prefix correction for each token‑level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position‑adaptive clipping that scales log‑space clip bounds according to the natural \sqrtt growth of the cumulative log‑ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool‑integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon‑llm/CTPO.

Authors:Lucas Hu, Ranchi Zhao, Isaac Zhu, Zach Zhang, Hscos Zhang, Hugh Yin, Jason Zhao
Title: SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
Abstract:
In large‑scale reinforcement learning (RL) systems with decoupled Trainer‑Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter‑node bandwidth is abundant, such synchronization is usually only a small fraction of end‑to‑end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth‑constrained or network‑variable deployments ‑‑ for example, cross‑datacenter or cross‑cluster settings, heterogeneous resource pools, and online RL ‑‑ weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large‑model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL‑Sync, which replaces full‑weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per‑update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL‑Sync also reduces launch and control‑plane overhead, significantly improving scalability and end‑to‑end efficiency in bandwidth‑limited and highly asynchronous RL settings.

Authors:Sum Kyun Song, Bong Gyun Shin, Jae Yong Lee
Title: Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation
Abstract:
Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real‑world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM‑based qualitative and quantitative evaluation. DoLQ employs a multi‑agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi‑dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.

Authors:Peter Pao-Huang, Xiaojie Qiu, Stefano Ermon
Title: Generative Modeling with Flux Matching
Abstract:
We introduce Flux Matching, a new paradigm for generative modeling that generalizes existing score‑based models to a broader family of vector fields that need not be conservative. Rather than requiring the model to equal the data score, the Flux Matching objective imposes a weaker condition that admits infinitely many vector fields whose stationary distribution is the data. This flexibility enables a class of generative models that cannot be learned under score matching, in which inductive biases, structural priors, and properties of the dynamics can be directly imposed or optimized. We show that Flux Matching performs strongly on high‑dimensional image datasets and, more importantly, that our added freedom unlocks a range of applications including faster sampling, interpretable and mechanistic models, and dynamics that encode directed dependencies between variables. More broadly, Flux Matching opens a new dimension in generative modeling by turning the vector field itself into a design choice rather than a fixed target. Code is available at https://github.com/peterpaohuang/flux_matching.

Authors:Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Predictive but Not Plannable: RC-aux for Latent World Models
Abstract:
A latent world model may achieve accurate short‑horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long‑horizon goal‑directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability‑Correction auxiliary objective (RC‑aux), a lightweight correction for this mismatch in reconstruction‑free latent world models. RC‑aux keeps the world‑model backbone unchanged and adds planning‑aligned supervision along two axes. Along the time axis, multi‑horizon open‑loop prediction trains the model beyond one‑step consistency. Along the space axis, budget‑conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability‑aware planner to favor trajectories that are both goal‑directed and attainable under the available budget. We instantiate RC‑aux on LeWorldModel and evaluate it under both continuation‑training and matched‑from‑scratch settings. Across goal‑conditioned pixel‑control tasks and a LIBERO‑Goal extension, RC‑aux improves LeWM‑style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC‑aux.

Authors:Hao Chen, Zavareh Bozorgasl
Title: Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning
Abstract:
Over‑the‑air federated learning (OTA‑FL) reduces uplink latency by exploiting waveform superposition, but conventional analog aggregation schemes typically require instantaneous channel state information (CSI), channel inversion, and coherent phase alignment, which can be difficult to maintain in practical wireless systems. This paper proposes resource‑element energy difference (REED), a noncoherent aggregation primitive for continuous signed updates that avoids instantaneous CSI. REED maps the positive and negative parts of each real‑valued update to transmit energies on two orthogonal resource elements with independent phase dithers, and the server estimates the signed aggregate from their energy difference. With only slow‑timescale calibration of average channel powers, REED is unbiased for the desired signed sum and admits an exact closed‑form variance under Rayleigh fading. We incorporate REED into full‑participation FedAvg and prove a smooth nonconvex stationarity bound. Under an average per‑client energy budget, the aggregation gain can be scheduled so that the REED‑induced perturbation scales quadratically with the local stepsize, yielding the canonical (1/sqrt(T)) stationarity rate. Experiments on MNIST and Fashion‑MNIST demonstrate that REED closely matches clean FedAvg and coherent CSIT aggregation in IID settings, while maintaining stable convergence with a moderate performance degradation under strong data heterogeneity.

Authors:Takato Honda
Title: Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition
Abstract:
How few parameters do we really need to forecast a periodic time series? An hourly electricity series, reshaped as a 24‑row matrix with one column per day, is approximately rank‑1: a daily shape modulated by a daily level (median centered rank‑1 energy 0.82 on GIFT‑Eval). Should we learn the shape? Smoothing, shrinkage, and low‑rank fits all seem like obvious upgrades over the simple average of the last K=2 cycles. On all 97 GIFT‑Eval configurations, we tested 8 such alternatives (e.g., Fourier, EWMA, James‑Stein, rank‑r SVD): none significantly beats the frozen baseline under Holm correction; two are significantly worse. The resulting method, FLAIR, is (a) Effective: matches PatchTST on aggregate GIFT‑Eval (relMASE 0.838 vs 0.849); (b) Compact: 28 scalars for hourly, 57 for weekly; (c) Fast: 22 minutes on one CPU core of a MacBook Pro; (d) Closed‑form & Hands‑Off: one SVD per period candidate, GCV‑averaged Ridge, no GPU, no pre‑training, no per‑task tuning. In the high‑rank‑1, many‑cycle regime, extra flexibility is estimation noise.

Authors:Akshita Singh, Prabesh Paudel, Siddhartha Roy
Title: Hallucination Detection via Activations of Open-Weight Proxy Analyzers
Abstract:
We introduce a proxy‑analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already‑generated text through a small locally hosted open‑weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT‑4 as when it is any open‑weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per‑head source‑document attention, entropy, MLP activations, logit‑lens trajectories, and three new token‑level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma‑2 at 2B and 9B, Pythia at 1.4B, and LLaMA‑3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP's token‑level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5‑7B reached an F1 of 0.717, just above ReDeEP's 0.713, while Qwen2.5‑0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen‑fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM‑AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.

Authors:Fred Zhangzhi Peng, Avishek Joey Bose, Anru R. Zhang, Alexander Tong
Title: Coupling Models for One-Step Discrete Generation
Abstract:
Generative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models(Coupling Models), a one‑step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi‑step sampler into a few steps, Coupling Model trains a purpose‑built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand‑specified data‑to‑noise couplings. Empirically,Coupling Model improves the strongest one‑step baselines in each domain: it reduces LM1B text‑generation perplexity by 33% at its lowest‑perplexity operating point, Fly Brain enhancer‑design FBD by 18%, and MNIST‑Binary FID by 46%. These results suggest that effective one‑step discrete generation depends strongly on how data and noise are coupled before decoding. Code is available at https://github.com/pengzhangzhi/Coupling‑Models.

Authors:Tianle Jiang, Yufa Zhou
Title: Simple KNN-Based Outlier Detection Achieves Robust Clustering
Abstract:
Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the robust k‑Means problem (i.e., k‑Means with outliers), the goal is to remove z outliers and minimize the k‑Means cost on the remaining points. Despite the close connection between robust k‑Means and outlier detection, both theoretical and empirical understanding of the effectiveness of classic outlier detection heuristics for robust k‑Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large K‑Nearest‑Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant‑factor reduction from robust k‑Means to standard k‑Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real‑world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN‑based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.

Authors:Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Title: AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning
Abstract:
Temporal knowledge graphs (TKGs) represent time‑stamped relational facts and support a wide range of reasoning tasks over evolving events. However, existing methods produce entity representations that are static at the entity level, in that each representation is a function of learned parameters only and retains no trace of the interactions in which the entity has participated. In this paper, we depart from this static view and propose that each entity be modeled as an adaptive process whose representation is refined every time the entity participates in a fact. To this end, we propose AdaTKG, which maintains a per‑entity memory that is updated with every observed interaction, with the memory accumulating online and predictions improving as more interactions arrive. Specifically, we instantiate the memory update as a learnable exponential moving average governed by a single shared scalar instead of using learnable parameters for each entity, enabling AdaTKG to handle entities unseen during training. Extensive experiments confirm consistent gains over TKG baselines, demonstrating the effectiveness of adaptive memory. Code is publicly available at: https://github.com/seunghan96/AdaTKG.

Authors:Mohamed Elrefaie, Dule Shu, Matthew Klenk, Faez Ahmed
Title: CarCrashNet: A Large-Scale Dataset and Hierarchical Neural Solver for Data-Driven Structural Crash Simulation
Abstract:
Crash simulation is a cornerstone of modern vehicle development because it reduces the need for costly physical prototypes, accelerates safety‑driven design iteration, and increasingly supports virtual testing workflows. At the same time, modeling structural crash mechanics remains exceptionally challenging: the response is governed by nonlinear contact, large deformation, material plasticity, failure, and complex multi‑body interactions evolving over space and time on high‑resolution finite‑element meshes. In this work, we introduce \textscCarCrashNet, a public high‑fidelity open‑source benchmark for data‑driven structural crash simulation. \textscCarCrashNet combines component‑scale and full‑vehicle simulations in a multi‑modal format, including more than 14,000 bumper‑beam pole‑impact simulations with varying geometry, materials, and boundary conditions, together with 825 full‑vehicle crash simulations built from three industry‑standard vehicle models of increasing structural complexity: Dodge Neon, Toyota Yaris, and Chevrolet Silverado. To establish the reliability of the benchmark, we validate our open‑source finite‑element workflow based on OpenRadioss against both experimental crash data and the commercial solver Ansys LS‑DYNA. We also introduce \textscCrashSolver, a machine‑learning model designed for full‑vehicle crash prediction from high‑resolution finite‑element crash data. We further perform extensive benchmarking across the released datasets and evaluate \textscCrashSolver against state‑of‑the‑art geometric deep learning and transformer‑based neural solvers. Our results position \textscCarCrashNet as a foundation for reproducible research in structural simulation, crashworthiness modeling, and AI‑driven virtual crash testing. The dataset is available at https://github.com/Mohamedelrefaie/CarCrashNet.

Authors:Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias
Title: Learning Visual Feature-Based World Models via Residual Latent Action
Abstract:
World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature‑based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature‑based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high‑dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as Residual Latent Action (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose RLA World Model (RLA‑WM), which predicts RLA values via flow matching. RLA‑WM outperforms both state‑of‑the‑art feature‑based and video‑diffusion world models on simulation and real‑world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA‑WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video‑aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla‑wm

Authors:Andy Dong, Ayfer Özgür
Title: Less Random, More Private: What is the Optimal Subsampling Scheme for DP-SGD?
Abstract:
Poisson subsampling is the default sampling scheme in differentially private machine learning, largely because its unstructured randomness yields tractable privacy amplification analyses. Yet this same randomness introduces substantial participation variance: each sample appears in very different numbers of training iterations. In this work, we show that this variance is not merely a practical artifact to be tolerated, but a fundamental source of suboptimal privacy amplification. We prove that Balanced Iteration Subsampling (BIS), a structured scheme in which each sample participates in exactly a fixed number of iterations, achieves stronger privacy amplification than Poisson subsampling and is optimal at both extremes of the noise spectrum (σ\to 0 and σ\to \infty). Our analysis reveals that the privacy‑noise tradeoff is governed not by maximizing randomness, but by eliminating participation variance while preserving uniform marginal participation across iterations. To translate this asymptotic theory into finite‑noise guarantees, we introduce a practical near‑exact Monte Carlo accountant for BIS, which removes the analytical slack of existing RDP and composition‑based PLD analyses. Evaluations across more than 60 practical DP‑SGD configurations show that BIS consistently outperforms Poisson subsampling in the low‑noise regimes most relevant for high‑utility private training, reducing the required noise multiplier by up to 9.6%. These results overturn the common intuition that more sampling randomness necessarily yields stronger privacy amplification: in DP‑SGD, structured participation can be both more practical and more private. Our implementation is available at https://github.com/dong‑xin‑ao‑andy/bis‑mc‑accountant.

Authors:Guyue Luo, Qiao Liu
Title: BGM-IV: an AI-powered Bayesian generative modeling approach for instrumental variable analysis
Abstract:
Instrumental‑variable (IV) regression enables causal estimation under endogeneity, but modern IV problems often involve nonlinear structural effects and high‑dimensional covariates. Existing nonlinear IV methods directly learn the causal relation in observed feature space or rely on learned representations within two‑stage or moment‑based procedures, which can struggle when the causal information is embedded in a high‑dimensional representation. We propose BGM‑IV, a latent Bayesian generative modeling approach that reframes nonlinear IV regression as posterior inference in a causally structured latent space. BGM‑IV infers latent components that separately capture shared confounding structure, outcome‑specific variation, treatment‑specific variation, and covariate‑only nuisance information. To account for endogeneity, BGM‑IV replaces the confounded outcome likelihood with an IV‑integrated pseudo‑likelihood that averages over instrument‑induced treatment values within the latent model. Across various benchmark datasets, BGM‑IV remains competitive in the classical low‑dimensional regime and performs best in high‑dimensional covariate regimes. Together, these results show that structured latent generative modeling provides a principled and effective strategy to nonlinear IV estimation with rich covariates. The code of BGM‑IV is available at https://github.com/liuq‑lab/BGM‑IV.

Authors:Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner
Title: Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Abstract:
Manually curated biomedical repositories ‑‑ spanning bioactivity, genomics, and chemistry ‑‑ are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost‑effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM‑based entity‑tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M‑paper, 2.5T‑token PubMed corpus; (2) hybrid sparse‑dense retrieval supporting entity‑filtered semantic queries over the tagged corpus; and (3) Starling, a multi‑agent deep research system that, given only a natural‑language task description, designs precision‑ and recall‑targeted retrieval filters, induces an extraction schema, and emits structured records with nuance‑rich fields and supporting passages. Across six tasks ‑‑ blood‑brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene‑disease associations, protein subcellular localization, and chemical reactions ‑‑ Starling produces ~6.3M records (91K‑3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier‑model rejection of our extractions is 0.6‑7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard ‑‑ e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI‑driven therapeutic design. Code and datasets: https://github.com/starling‑labs/starling.

Authors:Chenhui Xu, Ziyue Bai, Fuxun Yu, Heng Huang, Jinjun Xiong
Title: Rollback-Free Stable Brick Structures Generation
Abstract:
While autoregressive models have advanced 3D generation, creating physically stable brick structures remains a challenge due to the strict requirements of gravity and interconnectivity. Existing approaches rely on external physical simulators during inference to perform rejection sampling and brick‑by‑brick rollbacks, which severely bottlenecks efficiency. To address this, we propose a reinforcement learning paradigm that shifts physical validity enforcement from test‑time correction to training‑time policy optimization. By utilizing assembly‑level rewards, the model optimizes for collision avoidance, global connectivity, structural interlocking, and shape conformity. This paradigm allows the model to internalize physical priors, enabling the first rollback‑free generation of stable brick structures. Experimental results demonstrate that our approach achieves state‑of‑the‑art generation quality while accelerating inference speed by orders of magnitude. Our code and dataset are available at https://github.com/miniHuiHui/STABLE. Our models are available at https://huggingface.co/miniHui/STABLE.

Authors:Fred Zhangzhi Peng, Alexis Fox, Anru R. Zhang, Alexander Tong
Title: Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment
Abstract:
Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non‑sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective‑ and attention‑level modifications. We instead ask whether the internal representation geometry learned by next‑token prediction can be explicitly preserved during AR‑to‑DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR‑ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low‑data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models. Code is available at https://github.com/pengzhangzhi/Open‑dLLM.

Authors:Yuwei Yin, Chuyuan Li, Giuseppe Carenini
Title: IntentGrasp: A Comprehensive Benchmark for Intent Understanding
Abstract:
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high‑quality, open‑licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large‑scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT‑5.4, Gemini‑3.1‑Pro, and Claude‑Opus‑4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random‑guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine‑Tuning (IFT), which fine‑tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave‑one‑domain‑out (Lodo) experiments further demonstrate the strong cross‑domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

Authors:Naihe Feng, Yi Sui, Shiyi Hou, Ga Wu, Jesse C. Cresswell
Title: Conformal Agent Error Attribution
Abstract:
When multi‑agent systems (MAS) fail, identifying where the decisive error occurred is the first step for automated recovery to an earlier state. Error attribution remains a fundamental challenge due to the long interaction traces that large language model‑based MAS generate. This paper presents a framework for error attribution based on conformal prediction (CP) which provides finite‑sample, distribution‑free coverage guarantees. We introduce new algorithms for filtration‑based CP designed for sequential data such as agent trajectories. Unlike existing CP algorithms, our approach predicts sets that are contiguous sequences to enable efficient recovery and debugging. We verify our theoretical guarantees on a variety of agents and datasets, show that errors can be precisely isolated, then use prediction sets to rollback MAS to correct their own errors. Our overall approach is model‑agnostic, and offers a principled uncertainty layer for MAS error attribution. We release code at https://github.com/layer6ai‑labs/conformal‑agent‑error‑attribution.

Authors:Zhifeng Gu, Yuqi Wang, Bing Wang
Title: R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations
Abstract:
Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post‑hoc heuristics. In this paper, we propose R^3L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi‑hop reasoning requires repeated reference‑frame transformations, which accumulate errors in inferred relations and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to break coupled relation chains, and consistent spatial imagination to promote self‑consistency through an imagine‑and‑revise loop. We further introduce supportive spatial optimization to ease pose optimization via global‑to‑local coordinate re‑parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R^3L produces more physically feasible and semantically consistent layouts. Notably, our analysis shows that resolving frame‑induced inconsistencies is crucial for reliable multi‑hop relative spatial reasoning. The code is available at https://github.com/Neal2020GitHub/R3L.

Authors:Arash Shahmansoori
Title: The E$Δ$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality
Abstract:
We present the EΔ‑MHC‑Geo Transformer, a novel architecture that unifies Manifold‑Constrained Hyper‑Connections (mHC), Deep Delta Learning (DDL), and the Cayley transform to obtain input‑adaptive, unconditionally orthogonal residual connections. Unlike DDL, whose Householder operator is orthogonal only at β\in \0,2\, our Data‑Dependent Cayley rotation Q(x)=(I+(β/2)A(x))^‑1(I‑(β/2)A(x)) preserves orthogonality for all β and all inputs. To handle negation, an eigenvalue ‑1 case that Cayley provably excludes, we introduce the EΔ‑MHC‑Geo Hybrid, which combines Cayley rotation with Householder reflection via a learned operator‑selection gate X'=γ(X)Q(X)X+(1‑γ(X))H_2(X)X. A midpoint‑collapse regularizer, 4γ(1‑γ), encourages boundary gate decisions, where each selected component is orthogonal. In matched‑parameter comparisons, with approximately 1.79M parameters per model and mean +/‑ standard deviation over 3 seeds, against four baselines including the concurrent JPmHC, EΔ‑MHC‑Geo achieves the best long‑horizon stability, 1.9x over JPmHC and 3.8x over GPT; the best near‑π rotation loss, 4.5x over JPmHC on single‑plane; strong norm preservation, with 0.001 mean deviation; and 0.96 negation cosine alignment in a diagnostic reflection probe, all with 33% fewer layers. While JPmHC's wider representation excels on pure rotation, its finite Cayley residual mixer excludes an exact λ=‑1 operator and has no reflection branch, motivating our hybrid approach for accessing both connected components of O(n).

Authors:Yubo Jiang, Yitong An, Xin Yang, Abudukelimu Wuerkaixi, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang
Title: Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
Abstract:
Vision‑Language Models (VLMs) are frequently undermined by object hallucination, generating content that contradicts visual reality, due to an over‑reliance on linguistic priors. We introduce Positive‑and‑Negative Decoding (PND), a training‑free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our finding of an attention imbalance in VLMs, where visual features are under‑weighted. Our framework introduces a dual‑path contrast: a positive path that amplifies visual evidence and a negative path that constructs counterfactuals to penalize prior‑dominant generation. By contrasting outputs from both paths during decoding, PND steers generation toward visually grounded results. Experiments on POPE, MME, and CHAIR demonstrate state‑of‑the‑art performance without retraining.

Authors:Jon-Paul Cacioli
Title: Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
Abstract:
Aggregate metacognitive quality scores mask within‑model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six‑domain grouping) to 33 frontier LLMs from eight model families and computed Type‑2 AUROC per model‑domain cell using verbalized confidence (0‑100). Total observations: 47,151. Every model with above‑chance aggregate monitoring showed non‑trivial domain‑level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top‑2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom‑2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject‑level coherence analysis (within‑domain similarity ratio = 0.95) confirms the six‑domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within‑family profile‑shape clustering is significant for Anthropic, Google‑Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google‑Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe‑format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split‑half aggregate stability r = .893; profile‑level split‑half is weaker (grand median r = .184). These results show stable benchmark‑domain variation obscured by aggregate metrics, and support benchmark‑stage domain screening as a step before deployment in specific application areas.

Authors:Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang
Title: MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
Abstract:
Large language model (LLM)‑based Multi‑agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role‑specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non‑trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground‑truth labels. Furthermore, MASPO employs a data‑driven evolutionary beam search to efficiently navigate the high‑dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state‑of‑the‑art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.

Authors:Jakub Stępień, Marcin Mazur, Jacek Tabor, Przemysław Spurek
Title: SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
Abstract:
Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human‑understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real‑world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top‑K selection mechanism. Our method uses a differentiable Soft Top‑K operator to learn an input‑dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://github.com/St0pien/SoftSAE.

Authors:Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar
Title: UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
Abstract:
Self‑distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self‑generated trajectories are free‑form, correctness is task‑dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self‑distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi‑teacher agreement, EMA teacher stabilization, token‑level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self‑distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self‑distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

Authors:Magnus Victor Boock, Abdullah Akgül, Mustafa Mert Çelikok, Melih Kandemir
Title: Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies
Abstract:
We present a new operator‑theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distances or fails to satisfy the triangle inequality, our framework learns a Hilbert‑space displacement geometry where expected hitting times are realized as linear functionals of latent displacements. We prove that this representation exists under latent linear closure and is uniquely identifiable up to a bounded linear isomorphism. For finite‑dimensional implementations, we show that global hitting‑time error is bounded by one‑step transition error amplified by the environment's transient spectral radius. Furthermore, we provide finite‑sample guarantees accounting for approximation, statistical complexity, and trajectory‑label mismatch. Derived from this theory, we curate Isomorphic Embedding Learning (IEL) as a new goal‑agnostic foundation policy learning algorithm that anchors a HILP‑style consistency objective with explicit hitting‑time regression to ensure that the learned geometry reflects actual decision‑time progress. This asymmetric and compositional structure enables robust graph‑based multi‑stage planning for long‑horizon navigation. Our experiments demonstrate that IEL improves the state of the art of learning foundation policy policies from offline maze locomotion data. Our code can be found on https://github.com/MagnusBoock/IEL

Authors:Davide Rindori
Title: Neural-Actuarial Longevity Forecasting: Anchoring LSTMs for Explainable Risk Management
Abstract:
Traditional multi‑population models, such as the Li‑Lee framework, rely on the assumption of mean‑reverting country‑specific deviations. However, recent data from high‑longevity clusters suggest a systemic break in this paradigm. We identify a stationarity paradox where mortality residuals in countries like Sweden and West Germany exhibit persistent unit roots, leading to a systematic mispricing of longevity risk in linear models. To address these non‑linearities, we propose Hybrid‑Lift, a neural‑actuarial framework that combines Hierarchical LSTM networks with a Mean‑Bias Correction (MBC) anchoring mechanism. Positioned as a governance‑friendly model challenger rather than a replacement of classical approaches, the framework exhibits selective superiority on out‑of‑sample validation (2012‑2020): it outperforms Li‑Lee by 17.40% in Sweden and 12.57% in West Germany, while remaining comparable for near‑linear regimes such as Switzerland and Japan. We complement the predictive model with an integrated governance suite comprising SHAP‑based cross‑country influence mapping, a dual uncertainty framework for regulatory capital calibration (Swiss ES 99.0% of +1.153 years), and a reverse stress test identifying the critical shock threshold for solvency buffer exhaustion. This research provides evidence that neural networks, when properly anchored by actuarial principles, can serve as effective model challengers for longevity risk management under the SST and Solvency II standards.

Authors:Shouvik Sardar, Sourish Das
Title: TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices
Abstract:
Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource‑constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge‑deployable plant disease systems rely on end‑to‑end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware‑level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed‑form Bayesian classifier with a mobile‑grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8‑Nano (5.9 MB) for lesion localisation, MobileNetV3‑Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non‑iterative estimators via projection, for the classification. The Jacobi‑DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end‑to‑end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi‑GP, and demonstrate that the Jacobi‑DMR offers the best trade‑off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi‑DMR. All data and codes are available here: https://github.com/shouvik‑sardar/TinyBayes

Authors:Shaofeng Qin, Li Wang
Title: LINC: Decoupling Local Consequence Scoring from Hidden Matching in Constructive Neural Routing
Abstract:
Constructive neural routing solvers usually score the next action by matching a decoder context to candidate embeddings, hiding deterministic one‑step consequences such as travel, waiting, slack, and capacity changes. We propose LINC (Local Inference via Normed Comparison), a decoder‑side candidate decision architecture that computes these consequences explicitly. LINC uses them according to their decision role: centered relative consequences are compared by a shared linear local scorer, while feasible‑set summaries modulate the decoder context. This preserves standard global matching and relieves the hidden state from rediscovering transition arithmetic. The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) serves as the main constrained‑routing stress test; the same interface extends to the Capacitated Vehicle Routing Problem (CVRP) and Traveling Salesman Problem (TSP). In particular, for CVRPTW, LINC reduces PolyNet's Solomon/Homberger gaps from 13.83%/38.15% to 7.26%/14.71%; for TSP and CVRP, it also improves external‑benchmark gaps.

Authors:Guanmeng Xian, Ning Yang, Philip S. Yu
Title: Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks
Abstract:
Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion‑based promotion attacks. Existing defenses are largely limited to single‑modal settings and mainly focus on poisoning‑based threats, leaving evasion‑based threats underexplored. In this work, we first identify a cross‑modal gradient mismatch under the multi‑user promotion setting, where visual and textual perturbations are optimized in inconsistent directions due to the dominance of distinct user groups. This phenomenon dilutes the attack effectiveness and leads robust training to underestimate worst‑case risks. To address this issue, we propose Untargeted Adversarial Training with Multimodal Coordination (UAT‑MC). UAT‑MC tackles the challenge of unknown targeted items in evasion‑based attacks (as opposed to poisoning‑based attacks) by treating all items as potential targets, and introduces a gradient alignment mechanism to explicitly correct this mismatch. This design ensures synchronized perturbations across modalities, thereby maximizing adversarial strength for robust training. Extensive experiments demonstrate that UAT‑MC significantly improves robustness against promotion attacks while maintaining acceptable recommendation performance under the defense‑accuracy trade‑off. Code is available at https://github.com/gmXian/UAT‑MC.

Authors:Jun Li, Peifeng Lai, Xuhang Lou, Jinpeng Wang, Yuting Wang, Ke Chen, Yaowei Wang, Shu-Tao Xia
Title: Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
Abstract:
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi‑granular cross‑modal evidence to quantify and model uncertainty explicitly. At the inter‑video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three‑fold principle, we perform fine‑grained query identification, which then guides query‑adaptive calibrated learning. At the intra‑video level, to accumulate denser evidence, we formulate a soft query‑clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state‑of‑the‑art methods. Code is released at https://github.com/lijun2005/ICML26‑Holmes.

Authors:Rappy Saha, Jude Haris, Nicolas Bohm Agostini, David Kaeli, José Cano
Title: PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs
Abstract:
Power‑of‑two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit‑shift operations for inference. Prior work has shown that PoT‑quantized DNNs can preserve accuracy for tasks such as image classification; however, their performance on resource‑constrained edge devices remains insufficiently understood. While general‑purpose edge CPUs and GPUs do not provide optimized backends for bit‑shift operations, custom hardware accelerators can better exploit PoT quantization by implementing dedicated shift‑based processing elements. However, deploying PoT‑quantized models on such accelerators is challenging due to limited support in existing inference frameworks. In addition, the impact of different PoT quantization strategies on hardware design, performance, and energy efficiency during full inference has not been systematically explored. To address these challenges, we propose PoTAcc, an open‑source end‑to‑end pipeline for accelerating and evaluating PoT‑quantized DNNs on resource‑constrained edge devices. PoTAcc enables seamless preparation and deployment of PoT‑quantized models via TensorFlow Lite (TFLite) across heterogeneous platforms, including CPU‑only systems and hybrid CPU‑FPGA systems with custom accelerators. We design shift‑based processing element (shift‑PE) accelerators for three PoT quantization methods and implement them on two FPGA platforms. We evaluate accuracy, performance, energy efficiency, and resource utilization across a range of models, including CNNs and Transformer‑based architectures. Results show that our CPU‑accelerator design achieves up to 3.6x speedup and 78% energy reduction compared to CPU‑only execution for PoT‑quantized DNNs on PYNQ‑Z2 and Kria boards. The code will be publicly released at https://github.com/gicLAB/PoTAcc

Authors:Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry, Boris Ginsburg
Title: Normalized Architectures are Natively 4-Bit
Abstract:
Training large language models at 4‑bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low‑precision arithmetic. This removes the need for interventions‑such as applying random Hadamard transforms and performing per‑tensor scaling calculations‑to preserve model quality, and it enables stable end‑to‑end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba‑Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element‑wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal‑to‑noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at https://github.com/anonymous452026/ngpt‑nvfp4

Authors:Hugo Cazaux, Eyjólfur Ingi Ásgeirsson, Hlynur Stefánsson
Title: Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
Abstract:
Synthetic data has transformed language model training, yet its role in time series forecasting remains poorly understood. We present a large‑scale empirical study: nine experiment groups, 4,218 runs systematically evaluating synthetic time series augmentation across five architectures, four synthetic signals and seven datasets. The effect is sharply architecture‑conditional: channel‑mixing models (TimesNet, iTransformer) benefit in the majority of trials, while channel‑independent models (DLinear, PatchTST) are consistently degraded. In selected low‑resource settings the gains are striking: TimesNet trained on only 10% of Weather data with synthetic augmentation surpasses the full‑data baseline (4 of 16 sparsity‑dataset combinations). Averaged across all architectures, augmentation hurts in 67% of trials. We further find that only the Seasonal‑Trend generator reliably helps across the tested benchmarks, and that hard curriculum switching is actively harmful (+24% MSE degradation). These results provide concrete, actionable guidelines on how to use synthetic data: use synthetic augmentation with channel‑mixing architectures, use gradual annealing schedules, and treat low‑resource augmentation as architecture‑ and dataset‑dependent. Code is available at \hrefhttps://github.com/hugoiscracked/synthetic‑ts/tree/main

Authors:Geping Chen, Chunlin Li, Tianzhong Yang, Zhengyuan Zhu, Jing Zhou
Title: TabCF: Distributional Control Function Estimation with Tabular Foundation Models
Abstract:
Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification‑transparent, and tuning‑light causal estimation of distributional quantities, such as interventional means and quantiles; we also propose a copula‑based approximation for multivariate outcomes. TabCF performs favorably against representative methods across a broad range of small‑ to medium‑sized synthetic and real data scenarios. The central message is two‑fold: for practitioners, it highlights that TabCF is an effective tool for distributional causal inference; for researchers, it suggests that the proposed approach could be considered a strong baseline for future method development. Code is available at https://github.com/GepingChen/TabCF.

Authors:Sankarshana Venugopal, Mohammad Mostafavi, Jonghyun Choi
Title: DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
Abstract:
Diffusion‑based image‑to‑image (I2I) translation excels in high‑fidelity generation but suffers from slow sampling in state‑of‑the‑art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training‑free sampler that exploits the semi‑linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly‑efficient 1st‑ and 2nd‑order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd‑order baseline). Experiments on inpainting, stylization, and semantics‑to‑image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency‑quality tradeoffs, enabling real‑world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.

Authors:Hanyu Gao, Bin Cao, Yunyue Su, Tong-Yi Zhang, Qiang Liu
Title: XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction
Abstract:
Multiphase powder X‑ray diffraction (PXRD) analysis remains a fundamental bottleneck in structure identification, as real‑world synthesis often produces complex mixtures whose constituent phases (components) cannot be reliably disentangled. While recent advances in representation‑based crystal retrieval and generation suggest the possibility of inferring structures directly from PXRD, existing approaches largely assume single‑phase inputs and break down in multiphase settings. Here, we present XDecomposer, a prior‑free framework for joint decomposition and identification of multiphase XRD patterns without requiring candidate phase lists, structural templates, or prior knowledge of phase number. We formulate multiphase diffraction analysis as a set prediction problem, where the model infers an unordered set of phase‑resolved components, their mixture proportions, and corresponding structural representations within a unified architecture. A phase‑query‑driven decomposition mechanism, together with diffraction‑consistent physical reconstruction, enables accurate source separation while preserving crystallographic fidelity. Extensive experiments on both simulated and experimental datasets show that XDecomposer substantially improves reconstruction accuracy and phase identification across diverse chemical systems, while maintaining strong generalization to unseen mixtures. These results provide a practical route toward data‑driven, source‑resolved multiphase XRD analysis and reduce long‑standing dependence on prior‑guided iteratively phase matching. The code is openly available at https://github.com/Licht0812/XDecomposer

Authors:Yulong Huang, Xiang Liu, Hongxiang Huang, Xiaopeng Lin, Zunchang Liu, Xiaowen Chu, Zeke Xie, Bojun Cheng
Title: MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
Abstract:
Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self‑attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed‑form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum‑based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum‑based recurrence as a second‑order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks. Code: https://github.com/HuuYuLong/MomentumDeltaNet .

Authors:Zhiyuan Zhai, Xin Wang
Title: Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
Abstract:
Group‑relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within‑group reward spread to compute per‑trajectory advantages. In agentic environments each rollout is a long multi‑turn dialogue with one LLM call per step, so this multi‑sample multiplier dominates the total training cost. When every rollout of a prompt ends with the same reward, the group has zero reward variance and contributes no gradient, so the extra rollouts add no information; such groups are common in practice (typically around 40% of all groups), so the wasted‑compute fraction is substantial rather than marginal. Existing methods filter such groups at the prompt level, either after their rollouts are paid for or before any rollout begins, but both decide without using information that becomes available during the rollout itself. We instead ask whether the in‑group divergence between the partial trajectories at an intermediate step can already predict that the group will be zero‑variance: when the parallel rollouts have already converged on the same action prefix, the group is on track to produce a single reward, and we can stop early. We propose a one‑parameter gate that stops a group when the mean pairwise prefix edit distance between its partial action sequences falls below a threshold. On a 60‑iteration on‑policy GRPO run on ALFWorld with Qwen2.5‑7B, averaged over four random seeds, the gated arm finishes 10.7% faster in wall‑clock (bootstrap 95% CI excludes 0) and shifts held‑out success rate on 50 unseen tasks by +2.5 pp, with the held‑out gain tracing to a measurable reduction in zero‑advantage gradient‑batch dilution. Code is available at https://github.com/zhiyuanZhai20/selective‑rollout.

Authors:Mei Wu, Wenchao Weng, Wenxin Su, Wenjie Tang, Wei Zhou
Title: CoMemNet: Contrastive Sampling with Memory Replay Network for Continual Traffic Prediction
Abstract:
In recent years, the integration of non‑topological space modeling with temporal learning methods has emerged as an effective approach for capturing spatio‑temporal information in non‑Euclidean graphs. However, most existing methods rely on static underlying graph structures, which are inadequate for capturing the continuously expanding and evolving patterns in streaming traffic networks. To address this challenge, we propose a simple yet efficient dual‑branch continual learning framework for traffic prediction, named CoMemNet. The fast‑converging Online branch undertakes the primary prediction tasks, while the momentum‑updated Target branch extracts historical information using Wasserstein Distance features to create a Dynamic Contrastive Sampler (DC Sampler). This sampler selects a node set with significant dynamic network feature changes for training, effectively mitigating the issue of catastrophic forgetting. Additionally, the backbone incorporates a lightweight Node‑Adaptive Temporal Memory Buffer (TMRB‑N) to consolidate old knowledge through memory replay and address the risk of memory explosion. Finally, we provide two newly curated open‑source datasets. Experimental results demonstrate that CoMemNet achieves state‑of‑the‑art (SOTA) performance across all three large‑scale real‑world datasets. The code is available at: https://github.com/meiwu5/CoMemNet.

Authors:Anh H. Vo, Sungyo Lee, Phil-Joong Kim, Soo-Mi Choi, Yong-Guk Kim
Title: Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
Abstract:
Recent advances in large language models (LLMs) have significantly improved language‑driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language‑driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI‑in‑the‑loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state‑of‑the‑art performance in task‑based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed‑loop integration of generation and interaction for next‑generation multimedia systems. Our project page can be found at https://proj‑showcase.github.io/h3ds/.

Authors:Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony Deschênes, François Pomerleau, Philippe Giguère
Title: Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
Abstract:
Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour‑intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning‑based interpretation is bottlenecked by the severe scarcity of expert‑annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine‑grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo‑interpretation for high‑resolution, millimetre‑level aerial imagery. Importantly, we leverage the large‑scale vision‑language Nano Banana Pro model to simultaneously generate high‑fidelity images and their corresponding pixel‑aligned semantic masks from prompts. We introduce WilDReF‑Q‑V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real‑world data with AI‑generated images, highlighting that AI‑generated data is highly complementary to real‑world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt‑generated data significantly improve performance for underrepresented species, some of which saw per‑species F1 score gains of up to 30 %pt. We conclude that vision‑language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab‑ulaval.github.io/gen4regen.

Authors:Lei Jiang, Adrian Ildefonso, Daniel Loveless, Fan Chen
Title: LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites
Abstract:
Large language models (LLMs) impose rapidly growing energy demands, creating an emerging energy and carbon crisis driven by large‑scale inference. Solar‑powered, AI‑enabled low Earth orbit (LEO) satellites have been proposed to mitigate terrestrial electricity consumption, but their lifecycle carbon footprint remains poorly understood due to launch emissions, satellite manufacturing, and radiation‑hardened hardware requirements. This paper presents LLMSpace, the first carbon modeling framework for LLM inference on AI‑enabled LEO satellites. LLMSpace jointly models operational and embodied carbon, peripheral subsystems, radiation‑hardened accelerators and memories, and LLM‑specific workload characteristics such as prefill‑decode behavior and token generation. Using realistic satellite and GPU configurations, LLMSpace reveals key trade‑offs among carbon footprint, inference latency, hardware design, and operational lifetime for sustainable space‑based LLM inference. Source code: https://github.com/UnchartedRLab/LLMSpace.

Authors:Shivam Kumar Panda, M Khalid Jawed
Title: Discrete Elastic Ribbons: A Unified Discrete Differential Geometry Framework for One-Dimensional Energy Models
Abstract:
Elastic ribbons, slender structures whose length (L), width (W), and thickness (b) satisfy L \gg W \gg b, exhibit mechanical behaviors intermediate between one‑dimensional rods (L \gg W, b) and two‑dimensional plates (L, W \gg b). In quadratic Kirchhoff‑type rod‑based frameworks, such as Discrete Elastic Rods (DER), the governing equilibrium equations are independent of width, and therefore these models cannot capture width‑dependent mechanical effects. Reduced centerline‑based ribbon models attempt to capture width dependence via coupled bending‑twisting energies. However, their relative accuracy remain unclear due to the absence of a unified simulation framework. In this work, we formulate a framework grounded in discrete differential geometry where the energy is expressed as functions of coupled bending‑twisting strain measures along the centerline, rather than a linear sum of quadratic bending and twisting energies in DER. We derive analytical gradients and Hessians of the energy that enable implicit time integration. Within this unified setting, we compare five ribbon models: Kirchhoff, Sadowsky, Wunderlich, Sano, and Audoly. As a benchmark, a straight ribbon is longitudinally constrained into a pre‑buckled arch and subjected to transverse displacement, inducing a supercritical pitchfork bifurcation. Predicted bifurcation thresholds are compared against shell‑based finite element simulations, with the Sano model providing the closest agreement in capturing width‑dependent shifts. Our high‑performance JAX‑based implementation achieves \mathcalO(N) per‑iteration cost and also confirms that Sano model introduces negligible per‑iteration overhead relative to standard DER.

Authors:Jae-Won Chung, Zhirui Liang, Yanyong Mao, Jiasi Chen, Mosharaf Chowdhury, Vladimir Dvorkin
Title: OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
Abstract:
AI's growing compute demand and new datacenter buildouts present major capacity and reliability challenges for the electricity grid, leading to multi‑year interconnection delays for new datacenters and bottlenecking AI growth. To ease this strain, datacenters increasingly offer rapid power flexibility in response to grid signals, where the datacenter can increase or decrease its power consumption by adapting its workload in real time. In order to understand the impact of large datacenters on the grid and to facilitate the design of effective coordination strategies, we build OpenG2G, a simulation platform for AI datacenter‑grid runtime coordination. We show that OpenG2G is capable of answering a wide range of coordination questions by allowing users to implement and compare various control paradigms (including classic, optimization, and learning‑based controllers), and quantify how AI model and deployment choices affect datacenter flexibility and coordination outcomes. This versatility is enabled by OpenG2G's modular and extensible architecture: a datacenter backend driven by real measurements of production‑grade AI services, a grid backend built on high‑fidelity grid simulators, and a generic controller interface that closes the loop between them. We describe the design of OpenG2G and demonstrate its usefulness through realistic grid scenarios and AI workloads.

Authors:Taeyoung Kim, Joon-Hyuk Ko
Title: A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers
Abstract:
We propose an architecture that augments the Flux Neural Operator (Flux NO), which combines the classical finite volume method (FVM) with neural operators, with ViT‑based context injection. Our model is formulated as a hypernetwork: it extracts solution dynamics over a finite temporal window, encodes them with a recurrent Vision Transformer, and generates the parameters of a context‑conditioned neural operator. This enables the model to infer and solve conservation laws without explicit access to the governing equation or PDE coefficients. Experimentally, we show that the proposed method preserves the robustness, generalization ability, and long‑time prediction advantages of Flux NO over standard neural operators, while delivering reliable numerical solutions across a broad range of conservative systems, including previously unseen fluxes. Our code is available at https://github.com/xx257xx/CONTEXT_FLUX_NO.

Authors:Othmane Kabal, Mounira Harzallah, Fabrice Guillet, Hideaki Takeda, Ryutaro Ichise
Title: Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs
Abstract:
Graph Self‑Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large‑scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real‑world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text‑driven graphs for unsupervised term typing. We introduce Noise‑Aware Text‑Driven Graph GSSL (NATD‑GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual‑graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well‑defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean‑graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message‑passing designs are better suited to noisy, text‑driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD‑GSSL provides practical guidance for applying GSSL to real‑world, noisy graphs and achieves up to a 7% improvement over pretrained language model baselines. All code and benchmarks are publicly available at https://github.com/OthmaneKabal/MC2GAE.

Authors:Daniel Grimmer
Title: Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles
Abstract:
Evolutionary computation has long promised to deliver both high‑performance optimization tools as well as rigorous scientific simulations of Darwinian evolution. However, modern algorithms frequently abandon evolutionary fidelity for physics‑inspired heuristics or superficial biological metaphors. This paper derives a suite of advanced gradient‑based optimization algorithms directly from evolutionary first principles. We introduce Darwinian Lineage Simulations (DLS) to prove that, in an asexual context, Fisher's and Wright's historically opposed views of evolution are actually formally equivalent. This unification requires carefully partitioning Fisher's deterministically‑evolving total population into Wright's randomly‑drifting sub‑populations. We prove that proper bookkeeping requires introducing a specific kind of structured noise (the DLS noise relation). Crucially, however, any bookkeeping choices which satisfy this relation will result in a faithful simulation of evolution. Using this vast representational freedom, we prove that a broad family of battle‑tested optimization algorithms are already perfectly compatible with evolutionary dynamics. These include: Stochastic Gradient Descent, Natural Gradient Descent, and the Damped Newton's method among many others. By simply adding DLS noise (i.e., evolutionarily faithful genetic drift), these algorithms become scientifically valid in silico simulations of Darwinian evolution. Finally, we demonstrate that even the state‑of‑the‑art Adam optimizer can be brought into evolutionary compliance through a minor mathematical surgery.

Authors:Ahmed Abdelmuniem Abdalla Mohammed
Title: Adaptive Computation Depth via Learned Token Routing in Transformers
Abstract:
Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token‑Selective Attention (TSA), a learned per‑token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two‑layer multi‑layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end‑to‑end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty‑proportional routing without any explicit depth pressure: even at λ=0 (no depth regularisation), the task‑loss gradient alone drives the router to skip 20% of token‑layer operations. On character‑level language modeling, TSA saved 14‑23% of token‑layer operations (TLOps) across Tiny‑Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference‑time sparse execution for real wall‑clock speedup.

Authors:Yi Xie, Yangyang Xu, Yi Fan, Bo Liu
Title: SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
Abstract:
Large language models (LLMs) with a large number of parameters achieve strong performance but are often prohibitively expensive to deploy. Recent work explores using teams of smaller, more efficient LLMs that collectively match or even outperform a single large model. However, jointly updating multiple agents introduces compounding distribution shifts, making coordination and stability during training difficult. We address this by introducing Sequential Agent Tuning (SAT), a coordinator‑free training paradigm. SAT represents the team as a factorized policy and employs block‑coordinate updates over agents, enabling scalable, decentralized training without a central controller. Specifically, we develop a sequence‑aware, on‑policy advantage estimator that conditions on the evolving team policy, coupled with per‑agent KL trust regions that isolate occupancy drift. Theoretically, this framework provides two critical guarantees. First, it ensures monotonic improvement, stabilizing the training process. Second, it establishes provable plug‑and‑play invariance: any agent can be upgraded to a stronger model without retraining the rest of the team, with a formal guarantee that the performance bound improves. Empirically, a team of three 4B agents (12B total) trained with SAT surpasses the much larger Qwen3‑32B on AIME24/25 benchmarks by 3.9% on average. We validate our plug‑and‑play theory by swapping in two 8B agents, which boosts the composite score by 10.4%. We provide code and appendix of proof at https://github.com/Yydc/SAT‑AAMAS

Authors:Wilson Wu, Victor Lecomte, Michael Winer, George Robinson, Jacob Hilton, Paul Christiano
Title: Estimating the expected output of wide random MLPs more efficiently than sampling
Abstract:
By far the most common way to estimate an expected loss in machine learning is to draw samples, compute the loss on each one, and take the empirical average. However, sampling is not necessarily optimal. Given an MLP at initialization, we show how to estimate its expected output over Gaussian inputs without running samples through the network at all. Instead, we produce approximate representations of the distributions of activations at each layer, leveraging tools such as cumulants and Hermite expansions. We show both theoretically and empirically that for sufficiently wide networks, our estimator achieves a target mean squared error using substantially fewer FLOPs than Monte Carlo sampling. We find moreover that our methods perform particularly well at estimating the probabilities of rare events, and additionally demonstrate how they can be used for model training. Together, these findings suggest a path to producing models with a greatly reduced probability of catastrophic tail risks.

Authors:Jingsen Zhu, Silvia Sellán, Alexander Terenin
Title: A Bayesian Approach for Task-Specific Next-Best-View Selection with Uncertain Geometry
Abstract:
We develop a framework for task‑specific active next‑best‑view selection in 3D reconstruction from point clouds, by casting the problem in the language of Bayesian decision theory. Our framework works by (a) placing a prior distribution over the space of implicit surfaces, (b) using recently‑developed stochastic surface reconstruction methods to calculate the resulting posterior distribution, then (c) using the posterior distribution to carefully reason about which view to scan next. This enables us to perform camera selection in a manner that is directly optimized for the intended use of the reconstructed data ‑ meaning, we reduce uncertainty only in those regions that make a difference in the task at hand, as opposed to prior approaches that reduce it uniformly across space. We evaluate our method across three distinct downstream tasks: semantic classification, segmentation, and PDE‑guided physics simulation. Experimental results demonstrate that our framework achieves superior task performance with fewer views compared to commonly used baselines and prior general uncertainty‑reduction techniques.

Authors:Vasilis Perifanis, Foteini Nikolaidou, Nikolaos Pavlidis, Panagiotis Thomakos, Andreas Sendros
Title: Federated Learning for Early Prediction of EV Charging Demand
Abstract:
Accurate forecasting of electric vehicle (EV) charging demand is critical for grid stability, infrastructure planning, and real‑time charging optimization. In this work, we study the problem of early prediction of charging demand, where the total energy of a session is estimated using only information available at plug‑in time and during the first minutes of charging. This enables actionable decisions while the session is still in progress, which is of direct importance for EV network operators. We construct a session‑level dataset from the Adaptive Charging Network (ACN), combining session metadata with early‑window charging measurements, and derive tabular features capturing user intent, temporal patterns, and initial charging behavior. We focus on a single operational depot, Caltech, and model intra‑depot heterogeneity through station‑level client partitions while evaluating multiple model families in a federated learning (FL) setting. Our results show that federated models can approach centralized predictive performance while keeping data in‑depot, enabling privacy‑enhanced training across distributed charging infrastructures. Overall, we demonstrate that reliable demand estimates can be obtained early in the session with minimal data, and that FL provides a practical pathway toward scalable and privacy‑aware analytics for EV charging networks. Code is available at https://github.com/Indigma‑Innovations/federated‑learning‑ev‑charging‑demand.

Authors:Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao, Sam Tak Wu Kwong, Yuguang Fang
Title: Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
Abstract:
Long‑horizon LLM agents depend on intermediate information‑gathering turns, yet training feedback is usually observed only at the final answer, because process‑level rewards require high‑quality human annotation. Existing turn‑level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task‑specific verifiers. Conversely, label‑free RL methods extract self‑signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self‑Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential‑based turn‑level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability‑aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster‑level approximation. The objective generalizes information‑potential shaping from gold‑answer supervision to settings without task‑specific gold verifiers while avoiding the broadcasted rollout‑level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold‑answer limit, and show that SIOP improves average performance over verifier‑free outcome‑level baselines on seven search‑augmented agentic reasoning benchmarks while approaching a gold‑supervised outcome baseline. Code is available at https://github.com/dl‑m9/SIOP.git.

Authors:Han Wang, Jintao Zhang, Kai Jiang, Haoxu Wang, Jianfei Chen, Jun Zhu
Title: KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
Abstract:
LLM‑based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench‑X, a benchmark designed to answer this question through category‑aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from 1.58× to 1.44×; newly rescued kernels consistently underperform persistently correct ones (1.16× vs 1.58× speedup in round~0\to1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross‑hardware speedup variance reaches 21.4×. Besides, quantization remains completely unsolved (0/30 successes) despite non‑trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface‑level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

Authors:Weibin Gu, Chen Yang, Lu Shi
Title: Koopman Identification of Nonlinear Systems via Reservoir Liftings
Abstract:
Learning tractable linear representations of nonlinear dynamical systems via Koopman operator theory is often hindered by dictionary selection, temporal memory encoding, and numerical ill‑conditioning. Inspired by Reservoir Computing (RC) paradigm, this paper introduces the RC‑Koopman framework, which interprets reservoir as a stateful, finite‑dimensional Koopman dictionary whose temporal depth is explicitly controlled by its spectral radius. We show that the Echo State Property (ESP) guarantees well‑posedness and favorable numerical conditioning of the lifted Koopman approximation. A correlation‑based spectral radius selection algorithm aligns reservoir memory with dominant system timescales. Analysis reveals how the finite memory of the reservoir determines which Koopman eigenfunctions remain observable from the lifted features. Evaluation on synthetic benchmarks demonstrates that RC‑Koopman achieves a favorable balance between reconstruction accuracy of the underlying nonlinear dynamics and dynamical stability, compared to Extended Dynamic Mode Decomposition (EDMD) and Hankel‑based lifting approaches. Code available at: https://github.com/NEAR‑the‑future/RC‑Koopman.git

Authors:Yin Jun Phua
Title: A Foundation Model for Zero-Shot Logical Rule Induction
Abstract:
Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero‑shot rule induction. Rather than encoding literal identities, NRI represents literals using domain‑agnostic statistical properties such as class‑conditional rates, entropy, and co‑occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot‑based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T‑norm relaxation makes rule execution differentiable, allowing end‑to‑end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero‑shot transfer to real‑world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural‑rule‑inducer.

Authors:Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su
Title: Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
Abstract:
LLM post‑training typically propagates task gradients through the full depth of the model. Although this end‑to‑end structure is simple and general, it couples task adaptation to full‑depth activation storage, long‑range backward dependencies and direct task‑gradient access to pretrained representations. We argue that this full‑depth backward coupling can be unnecessarily expensive and intrusive, particularly when post‑training supervision is much narrower than pre‑training. To this end, we propose LoPT: Local‑Learning Post‑Training, a simple post‑training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second‑half block learns from the task objective, while the first‑half block is updated by a lightweight feature‑reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task‑induced backward path while limiting direct interference from narrow task gradients on early‑layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT

Authors:Mohamed Elhabebe, Ayman El-Baz, Qing Liu
Title: FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection
Abstract:
Automated glaucoma detection is critical for preventing irreversible vision loss and reducing the burden on healthcare systems. However, ensuring fairness across diverse patient populations remains a significant challenge. In this paper, we propose FairEnc, a fair pretraining method for vision‑language models (VLMs) that enables simultaneous debiasing across multiple sensitive attributes. FairEnc jointly mitigates biases in both textual and visual modalities with respect to multiple sensitive attributes, including race, gender, ethnicity, and language. Specifically, for the textual encoder, we leverage a large language model to generate synthetic clinical descriptions with varied sensitive attributes while preserving disease semantics, and employ a contrastive alignment objective to encourage demographic‑invariant representations. For the visual encoder, we propose a dual‑level fairness strategy that combines mutual information regularization to reduce statistical dependence between learned features and demographic groups, with multi‑discriminator adversarial debiasing. Comprehensive experiments on the publicly available Harvard‑FairVLMed dataset demonstrate that FairEnc effectively reduces demographic disparity as measured by DPD and DEOdds while achieving strong diagnostic performance under both zero‑shot and linear probing evaluations. Additional experiments on the private FairFundus dataset show that FairEnc consistently preserves fairness advantages under cross‑domain and cross‑modality settings and maintains diagnostic performance within a competitive range. These results highlight FairEnc's ability to generalize fairness under distribution shifts, supporting its potential for more equitable deployment in real‑world clinical settings. Our codebase and synthetic clinical notes are available at https://github.com/Mohamed‑Elhabebe/FairEnc

Authors:Shereen Elsayed, Ngoc Son Le, Ahmed Rashed, Lars Schmidt-Thieme
Title: Rethinking Convolutional Networks for Attribute-Aware Sequential Recommendation
Abstract:
Attribute‑aware sequential recommendation entails predicting the next item a user will interact with based on a chronologically ordered history of past interactions, enriched with item attributes. Existing methods typically leverage self‑attention mechanisms to aggregate the entire sequence into a unified representation used for next‑item prediction. While effective, these models often suffer from high computational complexity and memory consumption, limiting their ability to process long user histories. This constraint restricts the model's capacity to fully capture long‑term user preferences. In some scenarios, modeling item interactions purely through attention may also not be the most effective approach to extract sequential patterns. In this work, we propose ConvRec, an alternative method with linear computational and memory complexity that employs convolutional layers in a hierarchical, down‑scaled fashion to generate compact, yet expressive sequence representations. To further enhance the model's ability to capture diverse sequential patterns, each layer aggregates the neighboring items gradually to reach a comprehensive sequence representation. Extensive experiments on four real‑world datasets demonstrate that our approach outperforms state‑of‑the‑art sequential recommendation models, highlighting the potential of convolution‑based architectures for efficient and effective sequence modeling in recommendation systems. Our implementation code and datasets are available here https://github.com/ismll‑research/ConvRec.

Authors:Guangsheng Bao, Hongbo Zhang, Han Cui, Yanbin Zhao, Yue Zhang
Title: FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
Abstract:
Adapting pretrained models typically involves a trade‑off between the high training costs of backpropagation and the heavy inference overhead of memory‑based or in‑context learning. We propose FAAST, a forward‑only associative adaptation method that analytically compiles labeled examples into fast weights in a single pass. By eliminating memory or context dependence, FAAST achieves constant‑time inference and decouples task adaptation from pretrained representation. Across image classification and language modeling benchmarks, FAAST matches or exceeds backprop‑based adaptation while reducing adaptation time by over 90% and is competitive to memory/context‑based adaptation while saving memory usage by up to 95%. These results demonstrate FAAST as a highly efficient, scalable solution for supervised task adaptation, particularly for resource‑constrained models. We release the code and models at https://github.com/baoguangsheng/faast.

Authors:Ivan Bondarenko, Roman Derunets, Oleg Sedukhin, Mikhail Komarov, Ivan Chernov, Mikhail Kulakov
Title: RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
Abstract:
We present our winning system for Task~B (generation with reference passages) in SemEval‑2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT‑4o‑mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt‑oss‑120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno‑Lite‑0.1, a 7B domain‑adapted model with a strong cost‑‑performance trade‑off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval

Authors:Jingtao Zhou, Xirui Kang, Feiyang Huang, Lai-Man Po
Title: SpecPL: Disentangling Spectral Granularity for Prompt Learning
Abstract:
Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine‑grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low‑frequency bands and granular high‑frequency details. A frozen Visual Semantic Bank anchors text representations to universal low‑frequency invariants, mitigating overfitting. Crucially, fine‑grained discrimination is driven by counterfactual granule training: by permuting high‑frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug‑and‑play booster, revitalizing text‑oriented baselines like CoOp and MaPLe via visual‑side guidance. Experiments on 11 benchmarks demonstrate competitive state‑of‑the‑art performance, achieving a new performance ceiling of 81.51% harmonic‑mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability‑generalization trade‑off. Code is released at https://github.com/Mlrac1e/SpecPL‑Prompt‑Learning.

Authors:Keyu Chen, Nanfei Ye, Yida Wang, Wenchao Sun, Danqi Zhao, Hao Cheng, Sifa Zheng
Title: CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
Abstract:
Open‑loop imitation learning has advanced modern autonomous driving policy architectures, but closed‑loop deployment remains vulnerable to policy‑induced distribution shift. Existing post‑training paradigms exhibit fundamental trade‑offs: closed‑loop RL fine‑tuning provides grounded feedback from executed actions but is constrained by the sparsity of informative events, whereas counterfactual fine‑tuning provides dense supervision over candidate futures but inherits bias from imperfect future estimates. We introduce Counterfactual‑to‑Interactive Reinforcement Fine‑Tuning (CRAFT), an on‑policy framework that formulates closed‑loop post‑training as proxy‑residual optimization. CRAFT uses group‑normalized counterfactual advantages as a dense proxy for real closed‑loop advantages and aligns this proxy with the closed‑loop world through grounded residual correction from interaction‑critical events. To stabilize adaptation, CRAFT regularizes the online policy toward an EMA teacher via asymmetric KL self‑distillation. Theoretically, CRAFT decomposes the real closed‑loop policy gradient into proxy and residual terms under the same visited‑state distribution, reducing residual variance with an aligned proxy while mitigating proxy bias through grounded residual approximation. Empirically, CRAFT achieves the strongest closed‑loop gains on Bench2Drive across hierarchical planning, vision‑language‑action, and vocabulary‑scoring architectures. Ablations, scaling behavior, stability analyses, and transfer results further validate the complementary roles of dense counterfactual proxy and grounded residual correction. Project page: https://currychen77.github.io/CRAFT.

Authors:Fatima Ashraf, Muhammad Ayub Sabir, Junbiao Pang, Yufang Zhou, Yan Shang
Title: Discovering Sparse Counterfactual Factors via Latent Adjustment for Survey-based Community Intervention
Abstract:
Transportation surveys are widely used to understand travel preferences and adoption barriers, yet most survey‑based analyses remain descriptive or predictive and rarely provide sparse, policy‑feasible intervention strategies. We study sparse counterfactual community intervention from survey responses, where the goal is to shift a target respondent group toward a desired reference group through controllable survey‑variable adjustments. We formulate this task as a policy‑feasible distributional alignment problem using a fixed‑basis nonnegative latent representation that preserves pre/post comparability and provides a stable map from latent factors to original variables. To make latent movement actionable, target‑relevant latent factors are identified through Shapley‑guided attribution and transferred to controllable variables as intervention priorities. Feasible group‑level adjustments are then learned by minimizing an entropy‑regularized optimal‑transport discrepancy between the post‑intervention target distribution and the reference distribution, together with a weighted \ell_2,1 penalty that promotes shared policy‑lever sparsity. Experiments on real‑world transportation survey datasets show that the proposed framework produces compact and interpretable policy‑feasible interventions with explicit adjustment magnitudes, improves population‑level conversion, and preserves intervention sparsity. Code and datasets are publicly available at: https://github.com/pangjunbiao/latent‑group‑alignment.git

Authors:Seunghan Lee, Jaehoon Lee, Jun Seo, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Title: Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment
Abstract:
TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in‑context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test‑time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN‑T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN‑based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.

Authors:Nicholas J. Cooper, François G. Meyer, Michael L. Roberts, Carlos Zapata-Carratalá, Lijun Chen, Danna Gurari
Title: On the Architectural Complexity of Neural Networks
Abstract:
We introduce a unified theoretical framework for the rigorous analysis and systematic construction of deep neural networks (DNNs). This framework addresses a gap in existing theory by explicitly modeling the structure of tensor operations ‑‑ lower level information that is often abstracted. Our framework enables two novel objectives: (1) analysis of the evolution of architectural complexity over deep learning history, and (2) automatic construction of novel architectures based on new types of tensor operations. Our study of DNNs introduced over the past 40 years reveals a connection between groundbreaking architectures and increases in different types of architectural complexity. Moreover, we identify several large classes of higher complexity architectures that have not yet been explored. We then collect a dataset of 3,000+ higher complexity architectures, which we publicly release at: https://github.com/combinatoriallabs/ArchitecturalComplexity.

Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan
Title: Explaining and Preventing Alignment Collapse in Iterative RLHF
Abstract:
Reinforcement learning from human feedback (RLHF) typically assumes a static or non‑strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter‑steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low‑quality, high‑reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism‑design intervention that restores the missing steering term by regularizing the policy's parameter‑steering effect on RM updates. We instantiate FPO via a scalable first‑order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama‑3.2‑1B.

Authors:Stephen Price, Kyle Miller, Marco Musto, Kenneth Kroenlein, James Saal, Kyle Tsaknopoulos, Elke A. Rundensteiner, Danielle L. Cote
Title: HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray
Abstract:
Cold spraying is an increasingly common approach for repairing and manufacturing components due to its solid‑state manufacturing capabilities. However, process optimization remains difficult due to many interdependent parameters and the lack of large‑scale, machine‑readable data to support modeling. While the scientific literature contains many relevant experiments, results are inconsistently reported (often in tables and figures) and use non‑uniform units, limiting utilization at scale. To address these limitations, this work presents HUGO‑CS, a literature‑derived dataset of 4,383 cold‑spray experiments with 144 features from 1,124 sources, exceeding the previous largest dataset (137 samples) by 30x. With completely manual extraction requiring an average of 91 minutes per document, this work designs and leverages a Hybrid‑labeled, Uncertainty‑aware, General‑purpose, Observational extraction framework, called HUGO, to support this extraction. HUGO combines automated LLM‑based labeling with targeted manual label refinement to handle this experimental result extraction process from scientific literature. To balance labeling efficiency with extraction accuracy, HUGO introduces a Hierarchical Risk Mitigation (HRM) to route LLM outputs with a high risk of potential errors for manual review, while retaining low‑risk records as auto‑labeled. Lastly, HUGO post‑processing consolidates categorical descriptors, maps reported feedstock chemistries into structured continuous compositions, and normalizes units across sources. Of the 4,383 reported experiments, 1,765 are hand‑labeled, providing a high‑quality labeled subset for benchmarking, error analysis, and higher‑fidelity data points. All code to replicate this work, along with the complete HUGO‑CS dataset, are released under a CC‑BY license at https://github.com/sprice134/HUGO.

Authors:Yaobo Zhang
Title: Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks
Abstract:
Relative positional encodings determine which functions of query‑key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group‑theoretic views of linear translation‑invariant positional encodings, we study a non‑semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory‑polynomial features such as e^‑γd\cos(ωd), e^‑γd\sin(ωd), d e^‑γd\cos(ωd), and d e^‑γd\sin(ωd), for causal lag d=i‑j\geq 0. Thus the construction realizes a distance‑modulated phase basis d e^iωd, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan‑RoPE as a non‑semisimple one‑parameter representation, give its real block form, and specify the contragredient query action required by non‑orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel‑level diagnostics and a Jordan‑friendly synthetic language‑model task show that the coupled Jordan basis is useful when the target contains distance‑modulated phase interactions. On a small WikiText‑103 byte language model, a scaled‑exact variant improves over RoPE and direct‑sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

Authors:Nicolas Michel, Maorong Wang, Jiangpeng He, Toshihiko Yamasaki
Title: Continual Distillation of Teachers from Different Domains
Abstract:
Deep learning models continue to scale, with some requiring more storage than many large‑scale datasets. Thus, we introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining access to earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher. We also show that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves cross‑domain generalization. The code and implementation for this work are publicly available at: https://github.com/Nicolas1203/continual_distillation.

Authors:Evangelos Ntavelis, Sean Wu, Mohamad Shahbazi, Fabio Maninchedda, Dmitry Kostiaev, Artem Sevastopolsky, Vittorio Megaro, Trevor Phillips, Alejandro Blumentals, Shridhar Ravikumar, Mehak Gupta, Reinhard Knothe, Jeronimo Bayer, Matthias Vestner, Simon Schaefer, Thomas Etterlin, Christian Zimmermann, Mathias Deschler, Peter Kaufmann, Stefan Brugger, Sebastian Martin, Brian Amberg, Tom Runia
Title: Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
Abstract:
We propose HeadsUp, a scalable feed‑forward method for reconstructing high‑quality 3D Gaussian heads from large‑scale multi‑camera setups. Our method employs an efficient encoder‑decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV‑parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high‑resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi‑view human head datasets. HeadsUp achieves state‑of‑the‑art reconstruction quality and generalizes to novel identities without test‑time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality‑compute trade‑offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.

Authors:Skye Gunasekaran, Téa Wright, Rui-Jie Zhu, Jason Eshraghian
Title: Transformers with Selective Access to Early Representations
Abstract:
Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low‑level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first‑layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer‑grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early‑representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer (SATFormer), which preserves the first‑layer value pathway while controlling access with a context‑dependent gate. Across models from 130M to 1.3B parameters, SATFormer consistently improves validation loss and zero‑shot accuracy over the static value‑residual and Transformer baselines. Its strongest gains appear on retrieval‑intensive benchmarks, where it improves over static value residuals by approximately 1.5 average points, while maintaining throughput and memory usage close to the baseline Transformer. Gate analyses suggest sparse, depth‑dependent, head‑specific, and category‑sensitive access patterns, supporting the interpretation that SATFormer learns selective reuse of early representations rather than uniform residual copying. Our code is available at https://github.com/SkyeGunasekaran/SATFormer.

Authors:Xun Jiang, Yufan Gu, Disen Hu, Yuqing Hou, Yazhou Yao, Fumin Shen, Heng Tao Shen, Xing Xu
Title: Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
Abstract:
Multimodal learning often grapples with the challenge of low‑quality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self‑Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self‑guided calibration on‑the‑fly. The core of our proposed CPSC lies in a novel self‑calibrating training loop that seamlessly integrates two key modules: (1) Representation Self‑Calibration, which decomposes unimodal features into components, and selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self‑Calibration, which recalibrates the gradient flow during backpropagation based on instance‑wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self‑update strategy for the conformal predictor to ensure the entire system co‑evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state‑of‑the‑art methods. Our code is available at https://github.com/XunCHN/CPSC.

Authors:Valery Manokhin
Title: The Manokhin Probability Matrix: A Diagnostic Framework for Classifier Probability Quality
Abstract:
The Brier score conflates two distinct properties of probabilistic predictions: reliability (calibration error) and resolution (discriminatory power). We introduce the Manokhin Probability Matrix, a BCG‑style two‑dimensional diagnostic framework that separates them. Classifiers are placed on a 2x2 grid by Spiegelhalter Z‑statistic and AUC‑ROC expected rank, then assigned to one of four archetypes: Eagle (good on both axes), Bull (strong discrimination, poor calibration), Sloth (well‑calibrated, weak discriminator), and Mole (poor on both). Each archetype carries a distinct prescription. We populate the matrix from a large‑scale empirical study spanning 21 classifiers, 5 post‑hoc calibrators, and 30 real‑world binary classification tasks from the TabArena‑v0.1 suite. The assignment is unambiguous. CatBoost, TabICL, EBM, TabPFN, GBC, and Random Forest are Eagles. XGBoost, LightGBM, and HGB are Bulls; Venn‑Abers calibration cuts log‑loss by 6.5 to 12.6% on Bulls but degrades Eagles by 2.1%. SVM, LR, LDA, and the empirical base‑rate predictor are Sloths. MLP, KNN, Naive Bayes, and ExtraTrees are Moles. A theoretical asymmetry follows: no order‑preserving post‑hoc calibrator can add discriminatory power (Proposition 1), so calibration is the fixable part and discrimination is the hard part. The practical rule is direct: do not optimise aggregate Brier score without first decomposing it; optimise discrimination first, then fix calibration post‑hoc. Code and raw experimental data are available at https://github.com/valeman/classifier_calibration.

Authors:Faraz Kayani, Sarmad Kayani, Asad Ahmed, Radu Timofte, Dmitry Ignatov
Title: Real Image Denoising with Knowledge Distillation for High-Performance Mobile NPUs
Abstract:
While deep‑learning‑based image restoration has achieved unprecedented fidelity, deployment on mobile Neural Processing Units (NPUs) remains bottlenecked by operator incompatibility and memory‑access overhead. We propose an NPU‑aware hardware‑algorithm co‑design approach for real‑world image denoising on mobile NPUs. Our approach employs a high‑capacity teacher to supervise a lightweight student network specifically designed to leverage the tiled‑memory architectures of modern mobile SoCs. By prioritizing NPU‑native primitives ‑‑ standard 3x3 convolutions, ReLU activations, and nearest‑neighbor upsampling ‑‑ and employing a progressive context expansion strategy (up to 1024x1024 crops), the model achieves 37.66 dB PSNR / 0.9278 SSIM on the validation benchmark and 37.58 dB PSNR / 0.9098 SSIM on the held‑out test benchmark at full resolution (2432x3200) in the Mobile AI 2026 challenge. Following the official challenge rules, the inference runtime is measured under a standardized Full HD (1088x1920) protocol, where it runs in 34.0 ms on the MediaTek Dimensity 9500 and 46.1 ms on the Qualcomm Snapdragon 8 Elite NPU. We further reveal an "Inference Inversion" effect, where strict adherence to NPU‑compatible operations enables dedicated NPU execution up to 3.88x faster than the integrated mobile GPU. The 1.96M‑parameter student recovers 99.8% of the teacher's restoration quality via high‑alpha knowledge distillation (alpha = 0.9), achieving a 21.2x parameter reduction while closing the PSNR gap from 1.63 dB to only 0.05 dB. These results establish hardware‑aware distillation as an effective strategy for unifying high‑fidelity denoising with practical deployment across diverse mobile NPU architectures. The proposed lightweight student model (LiteDenoiseNet) and its training statistics are provided in the NN Dataset, available at https://github.com/ABrain‑One/NN‑Dataset.

Authors:Raphaël Le Bidan, Ahmad Ismail, Elsa Dupraz, Charbel Abdel Nour
Title: Leveraging Code Automorphisms for Improved Syndrome-Based Neural Decoding
Abstract:
Syndrome‑based neural decoding (SBND) has emerged as a promising deep learning approach for soft‑decision decoding of high‑rate, short‑length codes. However, this approach still has substantial room for improvement. In this paper, we show how to leverage code automorphisms to enhance the ability of existing SBND models to learn and generalize through data augmentation during training and inference. As a result, for the short high‑rate codes considered, we obtain models that closely approach MLD performance using small datasets and proper training. Our findings also suggest that many prior results for SBND models in the literature underestimate their true correction capability due to undertraining. Code to reproduce all results is available at: https://github.com/lebidan/sbnd.

Authors:Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn
Title: FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
Abstract:
Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail on financial domain, which exhibit unique characteristics. We propose a general 2x2 capability taxonomy for TSRMs by crossing 1) single‑entity vs. multi‑entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain ‑‑ where the distinction between deterministic assessment and stochastic prediction is particularly critical ‑‑ as ten financial reasoning tasks, forming the FinTSR‑Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR‑Bench with distinct chain‑of‑thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute‑in‑CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario‑Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR‑Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario‑Aware CoT consistently improves prediction accuracy over standard CoT. Code is publicly available at: https://github.com/seunghan96/FinSTaR.

Authors:Al Zadid Sultan Bin Habib, Gianfranco Doretto, Donald A. Adjeroh
Title: DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data
Abstract:
High‑dimensional tabular data lacks a natural feature order, limiting the applicability of permutation‑sensitive deep learning models. We propose DynaTab, a dynamic feature ordering‑enabled architecture inspired by neural rewiring. We introduce a lightweight criterion that predicts when feature permutation will benefit a dataset by quantifying its intrinsic complexity. DynaTab dynamically reorders features via a neural rewiring algorithm and processes them through a compact, dynamic order‑aware combination of separate learned positional embedding, importance‑based gating, and masked attention layers, compatible with any sequence‑sensitive backbone. Trained end‑to‑end with bespoke dynamic feature ordering (DFO) and dispersion losses, DynaTab achieves statistically significant gains, particularly on high‑dimensional datasets, where it is benchmarked against 45 state‑of‑the‑art baselines across 36 different real‑world tabular datasets. Our results position DynaTab as a compelling new paradigm for high‑dimensional tabular deep learning.

Authors:Akshat Singh Jaswal, Ashish Baghel, Paras Chopra
Title: Discovering Reinforcement Learning Interfaces with Large Language Models
Abstract:
Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory‑level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co‑design, as single‑component optimization fails catastrophically on at least one domain in our evaluation suite.

Authors:Alexander Matyasko, Xin Lou, Indriyati Atmosukarto, Wei Zhang
Title: TsallisPGD: Adaptive Gradient Weighting for Adversarial Attacks on Semantic Segmentation
Abstract:
Attacking semantic segmentation models is significantly harder than image classification models because an attacker must flip thousands of pixel predictions simultaneously. Standard pixel‑wise cross‑entropy (CE) is ill‑suited to this setting: it tends to overemphasize already‑misclassified pixels, which slows optimization and overstates model robustness. To address these issues, we introduce TsallisPGD, an adversarial attack built on the Tsallis cross‑entropy, a generalization of CE parameterized by q, which adaptively reshapes the gradient landscape by controlling gradient concentration across pixels. By varying q, we steer the attack toward pixels at different confidence levels. We first show that no single fixed‑q is universally optimal, as its effectiveness depends on the dataset, model architecture, and perturbation budget. Motivated by this, we propose a dynamic q‑schedule that sweeps q during optimization. Extensive experiments on Cityscapes, Pascal VOC, and ADE20K show that TsallisPGD, using a single validation‑selected schedule, achieves the best average attack rank across all evaluated settings and improves over CEPGD, SegPGD, CosPGD, JSPGD, and MaskedPGD in reducing accuracy and mIoU on both standard and robust models.

Authors:Jonathan Muhire
Title: Donor-Aware scRNA-seq Benchmarks for IBD Classification
Abstract:
Donor‑level disease classification from single‑cell RNA sequencing (scRNA‑seq) requires strict donor‑aware cross‑validation: naive pipelines that split cells randomly conflate training and test donors, inflating reported performance through pseudoreplication. We present a donor‑aware benchmark evaluating three feature representations across two independent IBD cohorts: centered log‑ratio (CLR) transformed cell‑type composition, GatedStructuralCFN dependency embeddings, and scVI variational autoencoder latent embeddings. The cohorts are the SCP259 ulcerative colitis atlas (UC vs. Healthy, n=30 donors, 51 cell types) and the Kong 2023 Crohn's disease atlas (CD vs. Healthy, n=71 donors, 55‑68 cell types across three intestinal regions). Compartment‑stratified CLR composition achieves AUROC 0.956 +/‑ 0.061 on SCP259; GatedStructuralCFN on the same features achieves 0.978 +/‑ 0.050. In the Kong cohort, CFN achieves its best performance in the colon region (0.960 +/‑ 0.055 after feature filtering), exceeding linear CLR (0.900 +/‑ 0.100), while terminal ileum classification is dominated by linear models (CatBoost CLR 0.967 +/‑ 0.075 vs. CFN 0.811 +/‑ 0.164). Cross‑dataset transfer (CD‑>UC, four shared cell types) achieves AUC 0.833 with XGBoost CLR; the reverse direction performs at chance. CFN edge stability analysis shows that compartment‑wise composition eliminates spurious unit‑sum‑induced instability present in global composition (Jaccard 0.026 vs. top‑20 recurrence 1.0). CFN shows a consistent numerical advantage over linear models in the colon region of CD (AUROC 0.960 vs. 0.900), though no inter‑method comparison reached statistical significance at n<=34 donors per region. Compartment‑aware feature construction is critical for both classification performance and structural interpretability. Code: https://github.com/Jonathan‑321/sfn‑scrna‑study

Authors:Gabriel Garcia
Title: The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Abstract:
Large language models often fail at simple counting tasks, even when the items to count are explicitly present in the prompt. We investigate whether this failure occurs because transformers do not represent counts internally, or because they cannot convert those representations into the correct output tokens. Across three model families, Pythia, Qwen3, and Mistral, ranging from 0.4B to 14B parameters, we find strong evidence for the second explanation. Linear probes recover the correct count from intermediate layers with near‑perfect accuracy (R^2>0.99), showing that the information is present. However, the internal directions that encode counts are nearly orthogonal to the output‑head rows for digit tokens (|\cos|\leq0.032). In other words, the model stores the count in a form that the digit logits do not naturally read out. We localize this failure with two interventions. Updating only the digit rows of the output head (36,864 parameters) substantially improves constrained next‑token digit prediction (60.7 to 100.0% across four tasks), but it does not fix autoregressive generation. By contrast, a small LoRA intervention on attention Q/V weights (7.67M parameters) improves upstream routing and achieves 83.1% +/‑ 7.2% in true greedy autoregressive generation. Logit‑lens measurements confirm the mechanism: the correct digit's vocabulary rank drops from 55,980 to 1, a 50,000x improvement. Additional norm, logit‑lens, and cross‑task analyses show that the bottleneck generalizes across character counting, addition, and list length, while remaining absent from broader multi‑step reasoning benchmarks, including MMLU, GSM8K, and DROP. These results identify counting failure as a geometric readout bottleneck rather than a failure of internal representation: the model knows the count but the output pathway is geometrically misaligned with the tokens needed to express it.

Authors:Seunghyun Ji
Title: Ortho-Hydra: Orthogonalized Experts for DiT LoRA
Abstract:
LoRA fine‑tuning of diffusion transformers (DiT) on multi‑style data suffers from \emphstyle bleed: a single low‑rank residual cannot represent several distinct artist fingerprints, and the optimizer converges to their average. Mixture‑of‑experts LoRA in the HydraLoRA style replaces the up‑projection with E heads under a router, but when every expert is zero‑initialized the router receives identical gradient from each head and remains at the uniform prior. The experts then evolve permutation‑symmetrically, and the network trains as a single rank‑r LoRA at E× the cost. We present Ortho‑Hydra, a re‑parameterisation that combines an OFT‑style Cayley‑orthogonal shared basis with per‑expert \emphdisjoint output subspaces carved from the top‑(Er) left singular vectors of the pretrained weight. Disjointness makes the router's per‑expert score non‑degenerate at step~0, so specialization receives gradient signal before any expert has trained. We test the predicted deadlock on a DiT pipeline by comparing two HydraLoRA baselines, a zero‑initialized shared‑basis variant and the original σ=0.1 Gaussian‑jitter mitigation, against Ortho‑Hydra under a matched optimiser, dataset, and step budget. Neither baseline leaves the uniform prior within the first 1\textk steps; Ortho‑Hydra begins de‑uniformising within the first few hundred. End‑task generation quality on multi‑style data is out of scope; we report the construction, the cold‑start mechanism, and the routing dynamics it changes. Code: https://github.com/sorryhyun/anima_lora.

Authors:Adrian Grassi
Title: OCRR: A Benchmark for Online Correction Recovery under Distribution Shift
Abstract:
Static benchmarks measure a model frozen at training time. Real systems face distribution shift: new categories, paraphrased queries, drift: and must recover online via user corrections. No existing benchmark measures recovery speed under correction streams. We introduce OCRR (Online Correction Recovery Rate): a benchmark that streams a corpus through a classification system, applies oracle or stochastic corrections to wrong predictions, and reports two curves: novel‑class accuracy and original‑distribution accuracy versus correction count. We evaluate the substrate alongside nine baseline algorithms from five families plus seven bounded‑storage variants of the substrate for the Pareto sweep, including standard online‑learning baselines (river), continual‑learning methods (EWC, A‑GEM, LwF), retrieval/parametric hybrids (kNN‑LM), parameter‑efficient fine‑tuning of a 1.5 B‑parameter encoder (LoRA on DeBERTa‑v3‑large), and a hash‑chained append‑only substrate (Substrate). On Banking77 and CLINC150, under oracle and sparse correction policies, the substrate is the only system that simultaneously recovers novel‑class accuracy (88.7 +/‑ 2.9 %) and retains original‑distribution accuracy (95.4 +/‑ 0.8 %) beating the next‑best published continual‑learning baseline by 32.6 percentage points at equal memory budget, and beating LoRA‑on‑DeBERTa‑v3‑large by 84.6 percentage points on retention. We further find that classification accuracy remains stable at 99 % even as approximate‑nearest‑neighbour recall@5 degrades from 0.69 to 0.23 across 10 k to 10 M corpus scales, suggesting the substrate's margin‑band majority vote is robust to retrieval imperfection in a way that pure top‑k recall metrics do not predict. Code and data are available at https://github.com/adriangrassi/ocrr‑benchmark.

Authors:Muhammad Muneeb, David B. Ascher
Title: EFGPP: Exploratory framework for genotype-phenotype prediction
Abstract:
Predicting complex human traits from genetic data is challenging because different genetic, clinical, and molecular data sources often contain different parts of the signal. Here, we present EFGPP, a reproducible framework for generating, ranking, and combining multiple types of data for genotype‑to‑phenotype prediction. We applied EFGPP to migraine prediction using UK Biobank data from 733 individuals. The framework combined genotype‑derived features, principal components, clinical and metabolomic covariates, and polygenic risk scores generated from migraine and depression GWAS using PLINK, PRSice‑2, AnnoPred, and LDAK‑GWAS. The best single data type achieved a test AUC of 0.644, while combining multiple data types improved performance to 0.688 using migraine‑focused inputs and 0.663 using cross‑trait depression‑derived inputs. Genetic features alone did not outperform the covariates‑only baseline, but genotype‑derived features performed better than PRS alone, and depression‑derived PRS showed useful predictive signal. Overall, EFGPP provides a practical proof‑of‑concept framework for prioritising and integrating heterogeneous genetic data sources for complex phenotype prediction.

Authors:Fang Wu, Weihao Xuan, Heli Qi, Hanqun Cao, Heng-Jui Chang, Zeqi Zhou, Haokai Zhao, Ma Jian, Carl Ma, Yu-Chi Cheng, Kuan Pang, Xiangru Tang, Zehong Wang, Guanlue Li, Hanchen Wang, Kejun Ying, Pan Lu, Chiho Im, Seungju Han, Peng Xia, Tinson Xu, Yinxi Li, Deyao Zhu, Pheng-Ann Heng, Naoto Yokoya, Masashi Sugiyama, Li Erran Li, Jure Leskovec, Yejin Choi
Title: Proteo-R1: Reasoning Foundation Models for De Novo Protein Design
Abstract:
Deep learning in \emphde novo protein design has achieved atomic‑level fidelity. However, existing models remain largely non‑deliberative: they directly synthesize molecular geometries without explicitly reasoning about which residues or interactions are functionally essential. As a result, design decisions are entangled with continuous sampling dynamics, limiting interpretability, controllability, and systematic reuse of biochemical knowledge. We introduce Proteo‑R1, a reasoning‑guided protein design framework that explicitly decouples \emphmolecular understanding from \emphgeometric generation. Proteo‑R1 adopts a dual‑expert architecture in which a multimodal large language model (MLLM) serves as an \emphunderstanding expert, analyzing protein sequences, structures, and textual context to identify key functional residues that govern binding and specificity. These residue‑level decisions are then passed as hard constraints to a separate diffusion‑based \emphgeneration expert, which performs conditional co‑design while respecting the fixed interaction anchors. This factorization mirrors how human experts approach molecular engineering: first, reasoning about critical interactions, then optimizing geometry subject to those constraints. By operationalizing reasoning as explicit residue‑level commitments rather than latent textual guidance, Proteo‑R1 achieves stable, interpretable, and modular integration of LLM reasoning with state‑of‑the‑art geometric generative models. Code, data, and demos are available at https://smiles724.github.io/r1/.

Authors:Bumjun Kim, Albert No
Title: Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings
Abstract:
Understanding how textual embeddings contribute to memorization in text‑to‑image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as <startoftext>, <prompt>, <endoftext> and <pad> with corresponding embeddings \mathbfv^\mathbfsot, \mathbfv^\mathbfpr, \mathbfv^\mathbfeot, \mathbfv^\mathbfpad. We discover that \mathbfv^\mathbfpr contribute minimally to generation in memorized cases. In contrast, \mathbfv^\mathbfpad strongly affect memorization due to their structural duplication of \mathbfv^\mathbfeot, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of \mathbfv^\mathbfeot, causing the model to over‑rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference‑time mitigation strategies: (1) Replacing the tokenizer's default <pad> from <eot> to the ! token before embedding, and masking the \mathbfv^\mathbfeot; (2) Partial masking of \mathbfv^\mathbfpad. Both suppress memorization without degrading quality, and are readily deployable without prior detection.

Authors:Shikhar Shukla
Title: SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
Abstract:
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length γ, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed γ (typically 4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present SpecKV, a lightweight adaptive controller that selects γ per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels (FP16, INT8, NF4), collecting 5,112 step‑level records with per‑step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal γ shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation \approx 0.56). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0% improvement over the fixed‑γ=4 baseline with only 0.34 ms overhead per decision (<0.5% of step time). The improvement is statistically significant (p < 0.001, paired bootstrap test). We release all profiling data, trained models, and notebooks as open‑source artifacts.

Authors:Kevin Riehl, Andres L. Marin, Nikofors Zacharof, Fan Wu, Patrick Langer, Robert Jakob, Anastasios Kouvelas, Georgios Fontaras, Michail A. Makridis
Title: ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
Abstract:
Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result‑generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content‑based scores for reproducibility assessments. Experiments on 213 ReScience C articles ‑ the largest cross‑domain benchmark of human‑validated computational reproducibility studies considered to date ‑ demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next‑generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.

Authors:Rahul Kumar
Title: The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
Abstract:
As frontier AI models are deployed in high‑stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6‑condition factorial design with dual‑classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all p < 2 × 10^‑8, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance‑forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near‑perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment‑specific training. We release the complete dataset and evaluation infrastructure.

Authors:Yan Jiang, Ruihong Qiu, Zi Huang
Title: Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
Abstract:
Recent diffusion large language models (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a block‑based semi‑autoregressive generation paradigm. Despite their progress, the fixed‑size block generations remain a critical bottleneck for effective and coherent reasoning. 1. From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a ``one‑size‑fits‑all'' assumption ineffective. 2. Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that for block‑wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, whereas the correctly generated tasks follow a consistent descending trend. Therefore, this paper proposes b1, a novel post‑training framework for dLLMs that learns dynamic‑size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence.b1 integrates seamlessly as a plug‑and‑play module with existing dLLM's post‑training algorithms. Extensive experiments across various reasoning benchmarks showcase b1's consistent improvement over existing fixed‑size block baselines. Our code has been released at https://github.com/YanJiangJerry/Block‑R1.

Authors:Pawel Kaplanski
Title: Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
Abstract:
Recursive language‑model loops often settle into recognizable attractor‑like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30‑step recursive loops by separating the model from the context‑update rule: append, replace, and dialog updates expose different histories to the same generator. The main result is that persistent redirection in append‑mode recursive loops is memory‑policy‑conditioned. Under a 12,000‑character tail clip, destination‑coherent persistence plateaus near 16 percent and retained source‑basin escape near 36 percent at dose 400; neither crosses 50 percent. Under a full‑history protocol, retained source‑basin escape crosses 50 percent near 400 tokens and saturates at 75‑80 percent by 1,500 tokens; destination‑coherent persistence first reaches 0.50 near 1,500 tokens (Wilson 95 percent CI [0.41, 0.61]). A four‑step falsification battery (heterogeneity control, granularity sweep with hierarchical macro‑merge, transition‑entropy diagnostic, and long‑horizon trajectory continuation) recasts the high‑dose destination‑coherent dip as a finite‑horizon, endpoint‑definition‑sensitive feature rather than a stable structural asymmetry. Half the canonical magnitude is endpoint timing; the residual drops 73 percent from ‑0.143 at step 29 to ‑0.039 at step 79 under the frozen canonical cluster basis, bootstrap interval straddling zero. Replace‑mode raw switching is near‑saturated under the default protocol but largely reflects state‑reset overwrite: insert‑mode probes drop it to 12‑32 percent. We report 37 experiments on gpt‑4o‑mini with within‑vendor replication on gpt‑4.1‑nano. Recursive‑loop evaluations should distinguish transient movement from durable escape, subtract stochastic floors, and treat context‑update rules as safety‑relevant design choices.

Authors:Abdullah Ahmad Khan, Hamid Laga, Ferdous Sohel
Title: Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
Abstract:
Machine unlearning in Vision‑Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU‑Bench, UnLOK‑VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA‑1.5‑7B models reveals two opposing clusters, FA, RA, MIA and AD, JS, with tau_FA_AD = ‑0.26, reproduced on BLIP‑2 OPT‑2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image‑and‑text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = ‑0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +‑ 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre‑computed results are available at https://github.com/neurips26/UnifiedUnl.

Authors:Jianing Zhang, Zijian Zhou, Kai Sun
Title: RAFNet: Region-Aware Fusion Network for Pansharpening
Abstract:
Pansharpening aims to generate high‑resolution multispectral (HRMS) images by fusing low‑resolution multispectral (LRMS) and high‑resolution panchromatic (PAN) images. Although deep learning has advanced this field, mainstream frequency‑based methods relying on standard scaled dot‑product attention suffer from quadratic computational complexity and fail to exploit the inherent regional sparsity of remote sensing imagery. Furthermore, existing spatial enhancement strategies typically employ static convolution kernels, which struggle to adapt to the complex frequency and regional variations of PAN and MS images. To address these bottlenecks, we propose a Region‑Aware Fusion (RAFNet) Network that synergistically models spatial and frequency information. Specifically, we design a Spatial Adaptive Refinement (SAR) module that leverages the discrete wavelet transform (DWT) for directional frequency separation and K‑means clustering for regional partitioning, which enables the dynamic construction of region‑specific adaptive convolution kernels, achieving spatially and frequency‑adaptive feature enhancement. Moreover, we introduce a Clustered Frequency Aggregation (CFA) module based on a sparse attention mechanism guided by the semantic clusters, which executes a region‑aware sparse attention strategy that drastically reduces computational redundancy while ensuring high‑quality frequency feature extraction. In addition we integrated these modules into a progressive, multi‑level spatial‑frequency network architecture to facilitate robust interaction and accurate image reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that the proposed RAFNet significantly outperforms state‑of‑the‑art pansharpening methods in both reduced‑ and full‑resolution assessments. The code is available at https://github.com/PatrickNod/RAFNet.

Authors:Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi
Title: Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution
Abstract:
Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low‑gradient‑magnitude features, input‑space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emphManifold‑Aligned Guided Integrated Gradients (MA‑GIG), which constructs attribution paths in the latent space of a pre‑trained variational autoencoder. By decoding intermediate latent states, MA‑GIG biases the path toward the learned generative manifold and reduces exposure to implausible input‑space regions. Through qualitative and quantitative evaluations, we demonstrate that MA‑GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off‑manifold noise and outperforms prior path‑based attribution methods across multiple datasets and classifiers. Our code is available at https://github.com/leekwoon/ma‑gig/.

Authors:Vik Pant, Eric Yu
Title: Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition
Abstract:
We present Coopetition‑Gym v1, a benchmark platform for mixed‑motive multi‑agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed‑form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward‑type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung‑Sony LCD, Renault‑Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game‑theoretic oracles, 2 heuristic baselines, and 101 constant‑action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708‑run training corpus and a 1,116‑run behavioral audit corpus, both released under CC‑BY‑4.0 with Croissant 1.0 metadata. Coopetition‑Gym v1 is the first platform to combine continuous‑action mixed‑motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game‑theoretic oracle baselines, and validated case studies.

Authors:Daniel da Silva Costa, Pedro Nuno de Souza Moura, Adriana C. F. Alvim
Title: How Can One Choose the Best CAM-Based Explainability Method for a CNN Model?
Abstract:
In recent years, several advances have been observed in Deep Learning with surprising results. Models in this area have been increasingly used in numerous applications, including those sensitive to human life, which require clear explanations and justifications. Various explainability methods have been proposed, but not many metrics to evaluate these methods. The most commonly used metric is the Intersection over Union (IoU). However, due to the characteristics of the results of the explainability methods, called saliency maps, which do not have a known shape, we hypothesise that there must be a better metric that allows one to find an explainability method that produces results that best resemble the human perception. We propose using different metrics to assess the similarity between human perception and the explanation saliency maps to find a better metric. An investigation was conducted employing a subset of the Chihuahuas images from ImageNet dataset. Several CAM‑based explainability methods were used to generate saliency maps for each chihuahua image. Alignment was measured by applying distance metrics between the bounding box of human annotations and the saliency maps produced by each explainability method. Rankings of the best saliency maps were created using the results of the distance metrics and compared to the ranking obtained using people's choice, collected through crowdsourcing, of the best explanation saliency maps for each selected image. Comparison between rankings was performed using the Rank‑Biased Overlap (RBO) metric. The results indicate the feasibility of our method to find the explainability method that best resembles human perception. In our experiments, the two metrics that best resemble human perception corresponded to Manhattan and Correlation. Besides, the best explainability methods regarding human perception were LayerCAM, Score‑CAM, and IS‑CAM.

Authors:Luo Ji, Qi Qin, Ningyuan Xi, Teng Chen, Qingqing Gu, Hongyan Li
Title: Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
Abstract:
Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of meta‑learning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta‑signal of β within the SwiGLU blocks, resulting in a meta‑gating mechanism that adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces β on textual conditions, providing meta‑controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta‑learning baselines, and can generalize reasonably on unseen tasks, condition types, or instructions. Our code can be found in https://github.com/AaronJi/MeGan.

Authors:Shengzhe Lyu, Yuhan She, Patrick S. Y. Hung, Ray C. C. Cheung, Weitao Xu
Title: ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA
Abstract:
Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with dynamic activation outliers that render static quantization ineffective, while uniform quantization fails to capture the weight distribution at low bit‑widths. Furthermore, while associative scan accelerates SSMs on GPUs, its memory access patterns are misaligned with the streaming dataflow required by FPGAs. To address these challenges, we present ViM‑Q, a scalable algorithm‑hardware co‑design for end‑to‑end ViM inference on the edge. We introduce a hardware‑aware quantization scheme combining dynamic per‑token activation quantization and per‑channel smoothing to mitigate outliers, alongside a custom 4‑bit per‑block Additive Power‑of‑Two (APoT) weight quantization. The models are deployed on a runtime‑parameterizable FPGA accelerator featuring a linear engine employing a Lookup‑Table (LUT) unit to replace multiplications with shift‑add operations, and a fine‑grained pipelined SSM engine that parallelizes the state dimension while preserving sequential recurrence. Crucially, the hardware supports runtime configuration, adapting to diverse dimensions and input resolutions across the ViM family. Implemented on an AMD ZCU102 FPGA, ViM‑Q achieves an average 4.96x speedup and 59.8x energy efficiency gain over a quantized NVIDIA RTX 3090 GPU baseline for low‑batch inference on ViM‑tiny. This co‑design shows a viable path for deploying ViM models on resource‑constrained edge devices.

Authors:Shengzhe Lyu, Yuhan She, Di Duan, Tao Ni, Yu Hin Chan, Chengwen Luo, Ray C. C. Cheung, Weitao Xu
Title: SwiftChannel: Algorithm-Hardware Co-Design for Deep Learning-Based 5G Channel Estimation
Abstract:
Channel estimation is crucial in 5G communication networks for optimizing transmission parameters and ensuring reliable, high‑speed communication. However, the use of multiple‑input and multiple‑output (MIMO) and millimeter‑wave (mmWave) in 5G networks presents challenges in achieving accurate estimation under strict latency requirements on resource‑limited hardware platforms. To address these challenges, we propose SwiftChannel, an algorithm‑hardware co‑design framework that integrates a hardware‑friendly deep learning‑based channel estimator with a dedicated accelerator. Our approach employs a convolutional neural network enhanced with a parameter‑free attention mechanism, which effectively reconstructs full‑resolution spatial‑frequency domain channel matrices from low‑resolution least squares (LS) estimates. We further develop a multi‑stage model compression pipeline combining knowledge distillation, convolution re‑parameterization, and quantization‑aware training, resulting in substantial model size reduction with negligible accuracy loss. The hardware accelerator, implementing the compressed model and the LS estimator on FPGA platforms using High‑level Synthesis (HLS), features a fine‑grained pipeline architecture and optimized dataflow strategies. Tested on a Zynq UltraScale+ RFSoC, the accelerator achieves sub‑millisecond latency, providing up to 24x speed‑up and over 33x improvement in energy efficiency compared to GPU‑based solutions. Extensive evaluations demonstrate that the proposed design generalizes not only across various noise levels and user mobilities, but also to a variety of unseen channel profiles, outperforming state‑of‑the‑art baselines. By unifying algorithmic innovation with hardware‑aware design, our work presents a future‑proof channel estimation solution for 5G MIMO systems.

Authors:An T. Le
Title: Training Non-Differentiable Networks via Optimal Transport
Abstract:
Neural networks increasingly embed non‑differentiable components (spiking neurons, quantized layers, discrete routing, blackbox simulators, etc.) where backpropagation is inapplicable and surrogate gradients introduce bias. We present PolyStep, a gradient‑free optimizer that updates parameters using only forward passes. Each step evaluates the loss at structured polytope vertices in a compressed subspace, computes softmax‑weighted assignments over the resulting cost matrix, and displaces particles toward low‑cost vertices via barycentric projection. This update corresponds to the one‑sided limit of a regularized optimal‑transport problem, inheriting its geometric structure without Sinkhorn iterations. PolyStep trains genuinely non‑differentiable models where existing gradient‑free methods collapse to near‑random accuracy. On hard‑LIF spiking networks we reach 93.4% test accuracy, outperforming all gradient‑free baselines by over 60~pp and closing to within 4.4~pp of a surrogate‑gradient Adam ceiling. Across four additional non‑differentiable architectures (int8 quantization, argmax attention, staircase activations, hard MoE routing) we lead every gradient‑free competitor. On MAX‑SAT scaling from 100 to 1M variables, we sustain above 92% clause satisfaction while evolution strategies drop 8‑‑12~pp. On RL policy search, we match OpenAI‑ES on classical control and retain performance under integer and binary quantization that collapses gradient‑based methods. We prove convergence to conservative‑stationary points at rate O(\log T/\sqrtT) on piecewise‑smooth losses, upgraded to Clarke‑stationary on the headline architectures and extended to the piecewise‑constant regime via a hitting‑time bound. These rates match the known zeroth‑order query‑complexity lower bounds that all forward‑only methods inherit. Code is available at https://github.com/anindex/polystep.

Authors:Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari
Title: Stochastic Sparse Attention for Memory-Bound Inference
Abstract:
Autoregressive decoding becomes bandwidth‑limited at long contexts, as generating each token requires reading all n_k key and value vectors from KV cache. We present Stochastic Additive No‑mulT Attention (SANTA), a method that sparsifies value‑cache access by sampling S \ll n_k indices from the post‑softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post‑softmax value aggregation while replacing value‑stage multiply‑accumulates with gather‑and‑add. We introduce stratified sampling to design variance‑reduced, GPU‑friendly variants, demonstrating 1.5× decode‑step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k‑token contexts. Finally, we propose Bernoulli qK^\mathsfT sampling as a complementary technique to sparsify the score stage, reducing key‑feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low‑rank projections, and KV‑cache compression. Together, they point toward sparse, multiplier‑free, and energy‑efficient inference. We open‑source our kernels at: https://github.com/OPUSLab/SANTA.git

Authors:Hongkun Pan, Yuwei Wu, Wanyi Hong, Shenghui Hu, Qitong Yan, Yi Yang, Rufei Han, Changju Zhou, Minfeng Zhu, Dongming Han, Wei Chen
Title: Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Abstract:
Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine‑grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus‑driven fine‑grained chart reasoning model, Chart‑FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose Focus‑CoT, a visual focusing chain‑of‑thought that enhances fine‑grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce Focus‑GRPO, a focus‑driven reinforcement learning algorithm with an information‑efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth as more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we build HID‑Chart, a challenging benchmark with an information‑density metric designed to evaluate fine‑grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that Chart‑FR1 outperforms state‑of‑the‑art MLLMs in chart understanding and reasoning. Code is available at https://github.com/phkhub/Chart‑FR1.

Authors:Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, Ehsan Adeli
Title: GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
Abstract:
Brain MRI foundation models learn rich representations of anatomy, but interpreting what clinical information they encode remains an open problem. Standard sparse autoencoders (SAEs) suffer from severe feature collapse in deep transformer layers, and in Alzheimer's disease (AD) research, aging confounds nearly every clinical variable, making naive annotation unreliable. We propose GeoSAE, a geometry‑guided SAE framework that uses the foundation model's learned manifold structure to prevent feature collapse and annotates each surviving feature via age‑deconfounded partial correlations. Applied to ~14k T1‑weighted MRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Australian Imaging biomarkers and Lifestyle (AIBL) datasets, GeoSAE identifies a compact, fully interpretable feature set that predicts mild cognitive impairment (MCI)‑to‑AD conversion (AUC 0.746) using only 2% of the embedding dimensions, while comorbidity‑annotated features achieve only chance‑level performance. The identified features replicate across cohorts without retraining (r=0.97) and localize to neuroanatomically distinct regions consistent with Braak staging. This shows that geometry‑guided SAEs can extract interpretable, biomarkers from frozen brain MRI foundation models.

Authors:Kwan Soo Shin
Title: The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't
Abstract:
An auditor instructs an AI assistant: "open each file individually using the Read tool ‑‑ no scripts, no agents." The AI replies "Yes" ‑‑ then issues a single batched call summarizing all fifty files at once. We call this the Compliance Gap: a third, orthogonal axis of AI honesty distinct from factual truthfulness and rhetorical substance. Three questions: does this verbal‑behavioral disconnect exist (existence); can any text‑only observer recover it (detectability); what infrastructure does AI deployment need (remedy)? Some 75 benchmarks (IFEval, SWE‑bench, BFCL, COMPASS, SpecEval) measure outcome fidelity; none measures process fidelity. Theorem 1 shows the gap is structurally inevitable under RL that rewards text without observing behavior. Theorem 2, via the Data Processing Inequality, shows it is undetectable from text alone ‑‑ by any human or LLM observer, present or future. Thirteen experiments and 2,031 sessions on six frontier models confirm both predictions. Under default framing, all six exhibit instruction compliance rates of 0% ‑‑ Claude Sonnet 4 verbally agrees ten out of ten times then bypasses in all ten. The gap is selective: 97% compliance where rationale is rewarded (audit trails), 0‑4% where it is not (file reading, privacy masking); removing delegation tools raises compliance to 75% (Cohen's d = 2.47), confirming environmental affordance rather than weight‑encoded failure. Nine blinded human raters achieve Fleiss' kappa = 0.130 and correctly identify zero of fifteen compliant sessions, exactly as Theorem 2 predicts. Where humans show 47% intention‑behavior gaps in psychology and 96.5pp gaps in surgical audits, RLHF‑trained models approach 100% under default conditions ‑‑ a regime warranting its own measurement infrastructure. We release BS‑Bench: the first open benchmark for process compliance, with seven tool‑call‑log audit metrics and a public leaderboard.

Authors:Qiao Liu
Title: Missingness-aware Data Imputation via AI-powered Bayesian Generative Modeling
Abstract:
Missing data imputation remains a fundamental challenge in modern data science, especially when uncertainty quantification is essential. In this work, we propose MissBGM, an AI‑powered missing data imputation method via Bayesian generative modeling that bridges the expressive flexibility of neural networks with the statistical rigor of Bayesian inference. Unlike existing methods that often focus on point estimates or treat the missingness mechanism implicitly, MissBGM explicitly and jointly models the data‑generating and missingness mechanisms, providing principled posterior uncertainty over imputations rather than a single point estimate. We develop a stochastic optimization framework with alternating updates among missing values, model parameters, and latent variables until convergence. Our theoretical analysis shows that estimates of missing values from MissBGM converge consistently under mild assumptions. Empirically, we demonstrate that MissBGM achieves superior performance over traditional imputers and recent neural network‑based methods across extensive experimental settings. These results establish MissBGM as a principled and scalable solution for modern missing data imputation. The code for MissBGM is open sourced at https://github.com/liuq‑lab/MissBGM.

Authors:Sungyoung Lee, Dohyeong Kim, Eshan Balachandar, Zelal Su Mustafaoglu, Keshav Pingali
Title: Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
Abstract:
We propose Flow‑Anchored Noise‑conditioned Q‑Learning (FAN), a highly efficient and high‑performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state‑of‑the‑art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

Authors:Viet Thanh Duy Nguyen, John K. Johnstone, Truong-Son Hy
Title: PRIME: Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies
Abstract:
Proteins are inherently multiscale physical systems whose functional properties emerge from coordinated structural organization across multiple spatial resolutions, ranging from atomic interactions to global fold topology. However, existing protein representation learning methods typically operate at a single structural level or treat different sources of structural information as parallel modalities, without explicitly modeling their hierarchical relationships. We introduce PRIME (Protein Representation via Physics‑Informed Multiscale Equivariant Hierarchies), a unified framework that models proteins as a nested family of five physically grounded structural graphs spanning surface, atomic, residue, secondary‑structure, and protein levels. Adjacent levels are connected through deterministic, physics‑informed assignment operators, enabling bidirectional information exchange via bottom‑up aggregation and top‑down contextual refinement. Experiments on standard protein representation learning benchmarks demonstrate strong and competitive performance across diverse tasks, with particularly notable gains on the Fold Classification benchmark, where PRIME outperforms the strongest geometric GNN baseline by margins of 13.80 and 18.30 points on the harder Superfamily and Fold splits, and achieves a state‑of‑the‑art accuracy of 84.10% on Reaction Class prediction, surpassing all baseline methods, including ESM. Ablation studies confirm that each structural level contributes complementary and non‑redundant information, and adaptive cross‑attention analysis reveals that PRIME autonomously identifies the most task‑relevant structural resolutions at prediction time. Our source code is publicly available at https://github.com/HySonLab/PRIME

Authors:Paul Garnier, Vincent Lannelongue, Elie Hachem
Title: Mesh Based Simulations with Spatial and Temporal awareness
Abstract:
Machine Learning surrogates for Computational Fluid Dynamics (CFD), particularly Graph Neural Networks (GNNs) and Transformers, have become a new important approach for accelerating physics simulations. However, we identify a critical bottleneck in the field: while architectures have advanced significantly, the common underlying training paradigms remain bound to naive assumptions, such as node‑wise supervision and explicit Euler time‑stepping. These legacy choices ignore the stiff dynamics and local flux continuity inherent to numerous partial differential equations resolution methods, such as Finite Element, Difference, or Volume (FEM). In this work, we propose a unified framework to bridge the gap between geometric deep learning and rigorous numerical analysis. We introduce three key innovations: (1) Multi Node Prediction, a stencil‑level objective that predicts field values for a node's full local topology, enforcing spatial derivative consistency; (2) Temporal Correction, replacing unstable explicit schemes with a predictor‑corrector via temporal Cross‑Attention; and (3) Geometric Inductive Biases, leveraging 3D Rotary Positional Embeddings (RoPE) to robustly capture rotational symmetries in unstructured meshes. We evaluate this framework across three architectures (MeshGraphNet, Transolver, and a Transformer) on diverse physics datasets. Our approach yields consistent improvements in accuracy and stability, particularly in long‑horizon rollouts, while producing latent representations that generalize to unseen subtasks such as Wall Shear Stress or Pressure prediction. Code is available at https://github.com/DonsetPG/graph‑physics.

Authors:Kanak Mazumder, Fabian B. Flohr
Title: LIE: LiDAR-only HD Map Construction with Intensity Enhancement via Online Knowledge Distillation
Abstract:
Online High‑Definition (HD) map construction is a key component of autonomous driving. Recent methods rely on multi‑view camera images for cost‑effective HD map segmentation, but cameras lack depth information for accurate scene geometry. In contrast, LiDAR provides precise 3D measurements but lacks dense semantic cues. In this work, we propose LIE, LiDAR‑only semantic map construction method that employ Knowledge Distillation (KD) to handle the lack of dense semantic and texture cues. Specifically, the teacher branch fuses student LiDAR features and the corresponding 2D intensity map tile to provide dense supervision for segmenting map elements using online distillation scheme. Experimental results show that our method outperforms all single‑modality approaches, achieving 8.2% higher mIoU than the state‑of‑the‑art camera‑based model on nuScenes. LIE is robust over long ranges and under challenging weather and lighting, and efficiently adapts to Argoverse2 with only 10% fine‑tuning, surpassing camera‑based models trained on the full dataset. Source code will be available \hrefhttps://iv.ee.hm.edu/lie/here.

Authors:Zhaoyang Li, Zhichao You, Tianrui Li
Title: SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion
Abstract:
Although multi‑modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross‑Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image‑plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross‑modal connection learnability. Extensive experiments show that SplAttN achieves state‑of‑the‑art performance on PCN and ShapeNet‑55/34. Crucially, we utilize the real‑world KITTI benchmark as a stress test for multi‑modal reliance. Counter‑factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross‑modal connection. Code is available at https://github.com/zay002/SplAttN.

Authors:Jianze Wang, Ying Liu, Jinlong Chen, Xuchun Hu, Qilong Zhang, Yu Cao, Jun Wang, Hua Yang, Yong Xie, Qianglong Chen
Title: MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
Abstract:
On‑policy distillation (OPD) trains a student on its own trajectories under token‑level teacher supervision, but existing methods are capped by a single‑teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per‑step errors compound across long trajectories and destabilize training. We propose MAD‑OPD (Multi‑Agent Debate‑driven On‑Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on‑policy state; the debate produces an emergent collective intelligence that supplies token‑level supervision, with each teacher's contribution weighted by its post‑debate confidence. To extend OPD to agentic tasks, we also introduce On‑Policy Agentic Distillation (OPAD), which adds step‑level sampling to stabilize training under multi‑step error compounding. We additionally derive a task‑adaptive divergence principle, selecting JSD (Jensen‑Shannon divergence) for agentic stability and reverse KL (Kullback‑Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher‑student configurations (Qwen3 and Qwen3.5; 1.7B‑14B students, 8B‑32B teachers) and five agentic and code benchmarks, MAD‑OPD ranks first across all six configurations; on the 14B+8B\to4B setting it lifts the agentic average by +2.4% and the code average by +3.7% over the stronger single‑teacher OPD.

Authors:Xiaorui Wang, Fanda Fan, Chenxi Wang, Yuxuan Yang, Rui Tang, Kuoyu Gao, Simiao Pang, Yuanfeng Shang, Zhipeng Liu, Wanling Gao, Lei Wang, Jianfeng Zhan
Title: CombinationTS: A Modular Framework for Understanding Time-Series Forecasting Models
Abstract:
Recent progress in time‑series forecasting has led to rapidly increasing architectural complexity, yet many reported State‑of‑the‑Art gains are statistically fragile or misattributed. We argue that progress requires a shift from model selection to modular attribution, identifying which components truly drive performance. We propose CombinationTS, a self‑contained probabilistic evaluation framework that decomposes forecasting models into orthogonal modules‑‑Input Transformation, Embedding, Encoder, Decoder, and Output Transformation‑‑and evaluates them under a shared evaluation condition space. By quantifying each component via marginalized performance (μ) and stability (σ), CombinationTS enables robust attribution beyond fragile point estimates. Through large‑scale paired evaluation, we uncover the Identity Paradox: once the data view (Embedding) is well‑designed, a parameter‑free Identity Encoder often matches or outperforms complex backbones. We further show that explicit structural priors introduced via Input Transformations yield a more favorable performance‑stability trade‑off than increasing Encoder complexity, establishing a principled baseline for architectural necessity.

Authors:James Butterworth, Gevik Grigorian, Alejandro DiazDelaO
Title: Deep Variational Inference Symbolic Regression
Abstract:
Symbolic regression discovers explicit, interpretable equations without assuming a functional form in advance. A Bayesian approach strengthens this through probability distributions over candidate expressions, thus quantifying uncertainty in the presence of noisy and limited data. Deep Symbolic Regression (DSR) uses a neural network to generate symbolic expressions, but it is designed to identify a single best‑fitting expression rather than infer a posterior distribution over models. We introduce Deep Variational Inference Symbolic Regression (DVISR), a variational Bayesian extension of DSR. DVISR replaces the original reward with the integrand of the evidence lower bound. It also extends the network architecture to output distributions over constants within expressions, enabling posterior inference over both expression trees and their associated constants. We show that DVISR can recover the true posterior in simple settings, both with and without constant tokens, and we examine how its performance changes as the size of the expression space increases. These results position DVISR as a step toward scalable Bayesian symbolic regression with uncertainty over full symbolic models.

Authors:Hao Zhou, Simon A. Lee, Cyrus Tanade, Keum San Chun, Juhyeon Lee, Migyeong Gwak, Megha Thukral, Justin Sung, Eugene Hwang, Mehrab Bin Morshed, Li Zhu, Viswam Nathan, Md Mahbubur Rahman, Subramaniam Venkatraman, Sharanya Arcot Desai
Title: Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning
Abstract:
Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process. However, most existing self supervised learning methods treat these signals as interchangeable views, overlooking the directional temporal dynamics that link them. A canonical example is the relationship between electrocardiography (ECG), which captures the electrical activation initiating each heartbeat, and photoplethysmography (PPG), which records the resulting peripheral pulse delayed by vascular dynamics. To capture this structured relationship, we introduce xMAE, a biosignal pretraining framework that leverages masked cross modal reconstruction across temporally ordered biosignals as a training time constraint to encourage physiologically meaningful timing structure in the learned representations. We show that pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis suggests that the ECG PPG timing structure is reflected in the learned PPG representations. More broadly, xMAE demonstrates the effectiveness of incorporating temporal structure into multimodal pretraining when signals observe different stages of a shared underlying process. Code is available at https://github.com/hzhou3/xMAE.

Authors:Wei Feng, Haiyong Zheng
Title: Structured Analytic Coherent Point Drift for Non-Rigid Point Set Registration
Abstract:
We introduce Analytic‑CPD, a structured analytic variant of coherent point drift for non‑rigid point set registration. The method retains the CPD posterior correspondence layer, but replaces the point‑indexed Gaussian‑kernel displacement‑field M‑step with a finite‑dimensional structured analytic mapping estimator. Posterior probabilities from the Gaussian mixture model are condensed through a barycentric identity into weighted soft target points, converting the CPD pairwise soft‑correspondence objective into a weighted analytic fitting problem. The deformation is represented by a truncated multivariate Taylor mapping of a vector‑valued function, so the number of deformation parameters is controlled by the ambient dimension and the analytic order rather than by an M‑by‑M kernel system over the moving points. A degree‑continuation strategy is further introduced to stabilize large‑deformation registration by progressively activating higher‑order analytic modes. Experiments on two‑dimensional analytic deformations and three‑dimensional smooth non‑analytic deformations show that Analytic‑CPD achieves lower final errors and faster convergence than standard CPD in representative large‑deformation settings. The results suggest that CPD‑style probabilistic correspondences and structured analytic mappings provide a compact and interpretable alternative to kernel‑based non‑rigid registration. Code is available at https://github.com/monge‑ampere/Analytic‑CPD.

Authors:Hada Melino Muhammad, Zechen Li, Flora Salim, Ahmed A. Metwally
Title: CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining
Abstract:
Continuous Glucose Monitoring (CGM) can detect early metabolic subphenotypes (insulin resistance, IR; β‑cell dysfunction), but population‑scale deployment faces two coupled problems. First, the same physiological state appears through multiple views (CGM time series, venous OGTT, Glucodensity summaries), so single‑view representations fail to transfer when deployment shifts the modality or setting. Second, baselines perform inconsistently across these shifts. Both problems point to one remedy: representations that abstract away from any single view to capture higher‑level temporal and distributional structure. We propose CGM‑JEPA, a self‑supervised pretraining framework which predicts masked latent representations rather than raw values, yielding abstraction that transfers across modalities. X‑CGM‑JEPA adds a masked Glucodensity cross‑view objective for complementary distributional information. We pretrain on ~389k unlabeled CGM readings from 228 subjects and evaluate on two clinical cohorts (N=27 and N=17 public‑release subsets) across three regimes (cohort generalization, venous‑to‑CGM transfer, home CGM) under 20‑iteration × 2‑fold cross‑validation. X‑CGM‑JEPA ranks first or second on AUROC for both endpoints across all three regimes while no baseline does, exceeding the strongest baseline by up to +6.5 pp in cohort generalization and +3.6 pp in venous‑to‑CGM transfer (paired Wilcoxon, p<0.001). Under modality shift, it matches mean AUROC while redistributing toward weaker subgroups (ethnicity AUROC gap shrinks 25‑54%); on sparse in‑domain venous data, the distributional view lifts label‑aware clustering (ARI +39%, NMI +40%). Code and weights: https://github.com/cruiseresearchgroup/CGM‑JEPA

Authors:Hongjun Wang, Po Hu, Kai Han
Title: Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models
Abstract:
Generalized Category Discovery (GCD) aims to categorize unlabelled instances from both known and unknown classes by transferring knowledge from labelled data of known classes. Existing methods assume all data comes from a single domain, yet real‑world unlabelled data often exhibits domain shifts alongside semantic shifts. We study GCD under domain shifts and propose three frameworks that adapt foundation models, ranging from self‑supervised vision models to vision‑language models. (i) HiLo disentangles domain and semantic features through multi‑level feature extraction and mutual information minimization, combined with PatchMix augmentation and curriculum sampling. (ii) HLPrompt extends HiLo with semantic‑aware spatial prompt tuning to suppress background and domain noise. (iii) VLPrompt leverages vision‑language models via factorized textual prompts and cross‑modal consistency regularization. The three methods share core design principles while operating on different foundation backbones, making them suitable for different deployment scenarios. Extensive experiments on synthetic corruptions and real‑world multi‑domain shifts demonstrate consistent improvements over strong baselines. Project page: https://visual‑ai.github.io/hilo/

Authors:Prabhjot Singh, Manmeet Singh
Title: When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping
Abstract:
Operational phase unwrapping is the primary computational bottleneck in InSAR‑based volcanic and seismic monitoring. We challenge the industry trend of adopting high‑complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics‑constrained geophysical regression. We present the first large‑scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant "complexity penalty": a vanilla U‑Net (7.76M parameters) achieves R^2=0.834 and RMSE = 1.01 cm, outperforming 11.37M‑parameter attention‑based models by 34% in R^2 and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high‑frequency artifacts (>0.3 cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a 2.5× speedup), the vanilla U‑Net is the only candidate to comfortably meet the sub‑100ms requirement for operational early‑warning systems. This work bridges the "publication‑to‑practice" gap by proving that convolutional locality outperforms modern complexity for smooth‑field regression, advocating for physics‑informed simplicity in ML4RS. Code available at https://github.com/prabhjotschugh/When‑Less‑is‑More‑InSAR‑Phase‑Unwrapping

Authors:Firat Ozdemir, Yun Cheng, Salman Mohebi, Fanny Lehmann, Simon Adamov, Zhenyi Zhang, Leonardo Trentini, Dana Grund, Oliver Fuhrer, Torsten Hoefler, Siddhartha Mishra, Sebastian Schemm, Benedikt Soja, Mathieu Salzmann
Title: Earth System Foundation Model (ESFM): A unified framework for heterogeneous data integration and forecasting
Abstract:
Foundation models (FMs) for the Earth system learn statistical relationships between physical variables across massive datasets to enable versatile downstream applications through finetuning, separating them from task‑specific weather models. Here, we introduce Earth System Foundation Model (ESFM), a fully open model building on the 3D Swin UNet backbone of the pioneering Aurora model. ESFM introduces extensions that increase functionality and foster adoption in climate sciences. First, the encoding scheme and training protocols have been extended to handle diverse datasets, including those containing missing values across all spatio‑temporal dimensions such as satellite data, as well as station data, all under one backbone. Axial attention is introduced to capture inter‑variable dependencies. As a result ESFM skillfully predicts variables in regions or on pressure levels where no data is present at the initial time, while preserving inter‑variable relationships, for example between temperature, pressure, and humidity. Individual variable tokenization enables different sets of variables to be shuffled during training and simplifies the process of building extensions for new downstream tasks. Adaptive layer norm‑based ensembles allow for a simple yet effective way to transform deterministic ESFM to a probabilistic FM. We present findings using dense gridded data (ERA5, CMIP6), regionally masked dense data, sparse gridded MODIS satellite data, and station data. Results demonstrate competitive or superior performance relative to state‑of‑the‑art benchmarks. Case studies of Super Typhoon Doksuri (2023) and 2024 sudden stratospheric warming events show accurate positional and magnitude estimations of extreme weather. ESFM retains the strengths of previous foundation models, such as long‑term stability, but facilitates application to a variety of downstream tasks.

Authors:Hao Xiao
Title: Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions
Abstract:
Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log‑domain Sinkhorn algorithm that combines warp‑level shuffle reductions with shared‑memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log‑domain, enabling robust computation for regularization parameters as small as epsilon = 10^‑4 where standard‑domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely‑used POT library and 5.9x speedup over GPU‑accelerated PyTorch baselines, while consuming only 256 MB of GPU memory. We validate our solver on image color transfer, 3D point cloud matching, and convergence analysis, demonstrating that native CUDA kernels with careful numerical treatment provide a practical and efficient foundation for large‑scale optimal transport computation.

Authors:Hao Xiao
Title: Sparse Regression under Correlation and Weak Signals: A Reproducible Benchmark of Classical and Bayesian Methods
Abstract:
Choosing between classical and Bayesian sparse regression methods involves a real trade‑off: penalized estimators like Lasso run in milliseconds but give no uncertainty estimates,while Horseshoe and Spike‑and‑Slab priors produce full posteriors but need MCMC chains that take minutes per fit.Surprisingly few studies compare these two families head‑to‑head under the conditions that actually make sparse regression hard ‑‑ correlated features, weak signals, and growing dimensionality. We benchmark six methods (OLS, Ridge,Lasso, Elastic Net, Horseshoe, Spike‑and‑Slab) on synthetic data with three covariance structures (rho up to 0.9), four SNR levels, and p in 20, 50, 100, plus the Diabetes dataset,totalling over 2,600 experiments. The results are clear on some points and nuanced on others. Bayesian methods win on prediction error (MSE 72 vs. 108‑267), and the Horseshoe delivers near‑nominal 95% coverage (94.8%). But Spike‑and‑Slab,despite narrower intervals, under‑covers at 91.9% ‑‑ its continuous relaxation likely plays a role. For variable selection, Lasso and Spike‑and‑Slab tie at F1 ~ 0.47, making Lasso the practical default when posteriors are not needed. Code and data are available at https://github.com/xiao98/sparse‑bayesian‑regression‑bench.

Authors:Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli
Title: BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis
Abstract:
Automatic generation of executable Blender code from natural language remains challenging, with state‑of‑the‑art LLMs producing frequent syntactic errors and geometrically inconsistent objects. We present BlenderRAG, a retrieval‑augmented generation system that operates on a curated multimodal dataset of 500 expert‑validated examples (text, code, image) across 50 object categories. By retrieving semantically similar examples during generation, BlenderRAG improves compilation success rates from 40.8% to 70.0% and semantic normalized alignment from 0.41 to 0.77 (CLIP similarity) across four state‑of‑the‑art LLMs, without requiring fine‑tuning or specialized hardware, making it immediately accessible for deployment. The dataset and code will be available at https://github.com/MaxRondelli/BlenderRAG.

Authors:Chaohao Yuan, Chenghao Xiao, Yu Rong, Hong Cheng, Long-Kai Huang
Title: Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
Abstract:
SFT and RLVR represent two fundamental yet distinct paradigms for LLM post‑training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge. Sequential training can cause catastrophic forgetting, and joint optimization often suffers from severe gradient conflicts. We analyze SFT and RLVR through the lens of task vectors and reveal three structural properties behind these failures: a 30 magnitude disparity, 45 sign interference, and heterogeneous module‑wise update distributions. These findings show SFT and RLVR are difficult to integrate directly, but they also suggest that the two paradigms modify partly complementary components of the model. Motivated by these observations, we propose Decoupled Test‑time Synthesis (DoTS), a post‑hoc framework allows SFT and RLVR checkpoints to be trained independently and synthesizes their capabilities only at inference time via task vector arithmetic, without updating model parameters. To reduce interference, DOTS applies selective sparsification with norm‑preserving rescaling. It then uses Bayesian optimization on a small set of unlabeled queries to search for combination coefficients on the Pareto frontier of consistency and perplexity. Empirically, \ours matches or exceeds the performance of training‑based SFT‑‑RLVR integration methods across multiple mathematical reasoning benchmarks, incurring only ~3% of the computational cost. When applied to stronger post‑trained checkpoints, DOTS surpasses SOTA models and generalizes to out‑of‑domain benchmarks without re‑tuning. Code is available at https://github.com/chaohaoyuan/DoTS.

Authors:Man Yung Wong
Title: Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Abstract:
Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/‑ 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/‑ 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per‑expert LIF membrane potential accumulating routing context across tokens; precision‑weighted gating (Pi), a per‑expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next‑state predictor conditioned on the beta‑accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super‑additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/‑ 0.001); beta alone gives modest gain (+0.295 +/‑ 0.013); combined they close 75% of the oracle gap (+0.741 +/‑ 0.002, exceeding the sum by +0.446 +/‑ 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre‑transition tokens are distributionally identical to within‑domain tokens. In a character‑level MoE LM (5 seeds), beta‑routing reduces transition‑step BPC from 6.56 +/‑ 0.01 (Standard) to 4.01 +/‑ 0.15 (beta‑MoE); the beta + Ant gate places 0.86 +/‑ 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/‑ 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity‑is‑not‑enough

Authors:Yao Ni, Jeremie Houssineau, Yew Soon Ong, Piotr Koniusz
Title: Possibilistic Predictive Uncertainty for Deep Learning
Abstract:
Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modelling. Existing methods for uncertainty modelling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second‑order predictors lack rigorous derivations connecting their specific objectives to epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet‑approximated possibilistic posterior predictions (DAPPr), a principled framework leveraging possibility theory. We define a possibilistic posterior over parameters, projects this posterior to the prediction space via supremum operators, and approximates the projected posterior using learnable Dirichlet possibility functions. This projection‑and‑approximation strategy yields a simple training objective with closed‑form solutions. Extensive experiments across diverse benchmarks demonstrate that our approach achieves competitive or superior uncertainty quantification performance compared to state‑of‑the‑art evidential deep learning methods while maintaining both principled derivation and computational efficiency. Code will be available at https://github.com/MaxwellYaoNi/DAPPr.

Authors:Ziwen Zhao, Menglin Yang
Title: Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation
Abstract:
Retrieval‑augmented generation (RAG) enhances large language models with external knowledge, and tree‑based RAG organizes documents into hierarchical indexes to support queries at multiple granularities. However, existing Tree‑RAG methods designed for single‑document retrieval face critical challenges in scaling to cross‑document multi‑hop questions: (1) poor distribution adaptability, where k‑means clustering introduces noise due to rigid distribution assumptions; (2) structural isolation, as tree indexes lack explicit cross‑document connections; and (3) coarse abstraction, which obscures fine‑grained details. To address these limitations, we propose Ψ‑RAG, a tree‑RAG framework with two key components. First, a hierarchical abstract tree index built through an iterative "merging and collapse" process that adapts to data distributions without a priori assumption. Second, a multi‑granular retrieval agent that intelligently interacts with the knowledge base with reorganized queries and an agent‑powered hybrid retriever. Ψ‑RAG supports diverse tasks from token‑level question answering to document‑level summarization. On cross‑document multi‑hop QA benchmarks, it outperforms RAPTOR by 25.9% and HippoRAG 2 by 7.4% in average F1 score. Code is available at https://github.com/Newiz430/Psi‑RAG.

Authors:Weifei Jin, Xilong Wang, Wei Zou, Jinyuan Jia, Neil Gong
Title: CleanBase: Detecting Malicious Documents in RAG Knowledge Databases
Abstract:
Retrieval‑augmented generation (RAG) is vulnerable to prompt injection attacks, in which an adversary inserts malicious documents containing carefully crafted injected prompts into the knowledge database. When a user issues a question targeted by the attack, the RAG system may retrieve these malicious documents, whose injected prompts mislead it into generating attacker‑specified answers, thereby compromising the integrity of the RAG system. In this work, we propose CleanBase, a method to detect malicious documents within a knowledge database. Our key insight is that malicious documents crafted for the same attack‑targeted questions often exhibit high semantic similarity, as attackers deliberately make them consistent to improve attack success rates. Accordingly, CleanBase constructs a similarity graph over the knowledge database, where each node represents a document and an edge connects two nodes if their semantic similarity‑‑computed using an embedding model‑‑exceeds a statistically determined threshold. Due to their inherent similarity, malicious documents tend to form cliques within this graph. CleanBase detects such cliques and flags the corresponding documents as malicious. We theoretically derive upper bounds on CleanBase's false positive and false negative rates and empirically validate its effectiveness. Experimental results across multiple datasets and prompt injection attacks demonstrate that CleanBase accurately detects malicious documents and effectively safeguards RAG systems. Our source code is available at https://github.com/WeifeiJin/CleanBase.

Authors:Maksym Nechepurenko, Pavel Shuvalov
Title: Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents
Abstract:
Evaluating the true forecasting ability of AI agents requires environments resistant to overfitting, free from centralized trust, and grounded in incentive‑compatible scoring. Existing benchmarks either rely on static datasets vulnerable to training‑data contamination, or measure trading PnL ‑‑ a metric conflating predictive accuracy with timing, sizing, and risk appetite. We introduce Foresight Arena, the first permissionless, on‑chain benchmark for evaluating AI forecasting agents on real‑world prediction markets. Agents submit probabilistic forecasts on binary Polymarket markets via a commit‑reveal protocol enforced by Solidity smart contracts on Polygon PoS; outcomes are resolved trustlessly through the Gnosis Conditional Token Framework. Performance is measured by the Brier Score and a novel Alpha Score ‑‑ proper scoring rules that incentivize honest probability reporting and isolate predictive edge over market consensus. We provide a formal analysis: closed‑form variance for per‑market Alpha, the connection to Murphy's classical Brier decomposition, and a power analysis characterizing the number of rounds required to reliably distinguish agents of different skill levels. We show that detecting a true edge of α^ = 0.02 at 80% power requires approximately 350 resolved binary predictions (50 rounds of 7 markets), while α^ = 0.01 requires four times more. We complement these analytical results with a 50‑round live evaluation of five frontier LLM agents plus a random baseline. Murphy decomposition distinguishes well‑calibrated agents from market‑tracking agents that fail through reduced resolution. All smart contracts and evaluation infrastructure are open‑source.

Authors:Jiale Fu, Yuchu Jiang, Peijun Wu, Chonghan Liu, Joey Tianyi Zhou, Xu Yang
Title: Rethinking LLM Ensembling from the Perspective of Mixture Models
Abstract:
Model ensembling is a well‑established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture‑model‑like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x‑2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token‑level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token‑level routing strategies for LLMs. Our code is available at https://github.com/jialefu/Mixture‑model‑like‑Ensemble/.

Authors:Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Li Wang, Xiaodong Lu, Wei Lin, Ran He, Guojun Yin
Title: ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over‑incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative‑positive head‑gradient interference and derive a single‑forward proxy that upper‑bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative‑token hidden representations onto an SVD‑based low‑rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4% in Avg@16 and 7.0% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.

Authors:Anamika Lochab, Bolian Li, Ruqi Zhang
Title: Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single‑attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi‑sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self‑reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy‑regularized optimality, which identify the Uniform‑Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform‑Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B‑7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10% absolute improvement on AIME24 at Pass@64 and up to 45% higher equation‑level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.

Authors:Yuhui Lu, Wenjing Liu, Kun Zhan
Title: Information-geometric adaptive sampling for graph diffusion
Abstract:
Standard diffusion models for graph generation typically rely on uniform time‑stepping, an approach that overlooks the non‑homogeneous dynamics of distributional evolution on complex manifolds. In this paper, we present an information‑geometric framework that reinterprets the diffusion sampling trajectory as a parametric curve on a Riemannian manifold. Our key observation is that the Fisher‑Rao metric provides a principled measure of the intrinsic distance. By analyzing this metric, we derive the Drift Variation Score (DVS), a geometry‑aware indicator that quantifies the instantaneous rate of distributional change. Unlike prior heuristic‑based adaptive samplers, our DVS solver enforces a constant informational speed on the statistical manifold, automatically maintaining a uniform rate of distributional change along the sampling trajectory. This equal arc‑length strategy ensures that each discretization step contributes equally to the information speed. Theoretical analysis verifies that DVS characterizes the local stiffness of the sampling dynamics in the Fisher‑Rao sense. Experimental results on molecule and social network generation show that DVS significantly improves structural fidelity and sampling efficiency. Code is at https://github.com/kunzhan/DVS

Authors:Jesse Schneider, William J. Welch
Title: Bayesian Optimization in Linear Time
Abstract:
Bayesian optimization is a sequential method for minimizing objective functions that are expensive to evaluate and about which few assumptions can be made. By using all gathered data to train a Gaussian process model for the function and adaptively employing a mixture of global exploration and local exploitation, this method has been used for optimization in many fields including machine learning, automotive engineering and reinforcement learning. However, the standard method suffers from two problems: 1) with cubic computational complexity in the training‑set size it eventually becomes computationally infeasible to train the model, and 2) globally modeling the objective function is not necessarily optimal given the local nature of minimization. Using flexible and recursive binary partitioning of the search space, we adapt both the modeling and acquisitive aspects of standard Bayesian optimization to work harmoniously with the partitioning scheme, thereby ameliorating both standard shortcomings. We compare our method against a commonly used Bayesian optimization library on seven challenging test functions, ranging in dimensionality from 6 to 124, and show that our method achieves superior optimization performance in all tests. In addition our method has linear computational complexity.

Authors:Eichi Uehara
Title: SHIFT: Robust Double Machine Learning for Average Dose-Response Functions under Heavy-Tailed Contamination
Abstract:
Double‑machine‑learning pipelines for the Average Dose‑Response Function rely on kernel‑weighted local‑linear smoothers, which inherit unbounded functional influence: a single outlier within a kernel window biases the curve across the entire window. We introduce SHIFT (Self‑calibrated Heavy‑tail Inlier‑Fit with Tempering), a robust DML estimator combining cross‑fit nuisance orthogonalization with a kernel‑local Welsch‑loss second stage optimized by Graduated Non‑Convexity, and ‑‑ the principal design choice ‑‑ a defensive OLS refit whose inlier cutoff is scaled by post‑GNC residual MAD rather than the raw‑outcome MAD. On a localized‑contamination stress test at p=0.25 this design choice drops level‑RMSE from 1.03 to 0.33 while leaving clean and uniformly‑contaminated runs unchanged. Across 1,400 main‑sweep fits, SHIFT has competitive worst‑case shape recovery (RMSE 0.325 at p=0.25, second to Huber‑DML's 0.276); among the three methods with worst‑case RMSE below 0.35, only SHIFT emits a non‑uniform per‑sample weight vector, recovering the ground‑truth outlier mask at mean F_1 \approx 0.96 (range 0.945‑‑0.968) on Gaussian‑jump DGPs. We pair the estimator with a six‑technique Extreme Value Theory diagnostic suite (Hill, GPD‑MLE/PWM, GEV, Mean Excess, parameter stability, causal tail coefficient) that lets a practitioner distinguish Frechet from Weibull regimes and choose between SHIFT and L1 alternatives on empirical grounds. Extensions to binary‑treatment CATE (Huber pseudo‑outcome X‑Learner) and time‑series ADRF (block‑CV + rolling MAD) are included. A counter‑intuitive ablation: linear nuisance models (Ridge, Lasso) outperform gradient‑boosted nuisances for robust DML under uniform contamination, inverting the usual more‑flexible‑is‑better heuristic.

Authors:YiFeng Wang, Zhun Sun, Keisuke Sakaguchi
Title: Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
Abstract:
We present Activation Residual Hessian Quantization (ARHQ), a post‑training weight splitting method designed to mitigate error propagation in low‑bit activation‑weight quantization. By constructing an input‑side residual Hessian from activation quantization residuals (G_x), ARHQ analytically identifies and isolates error‑sensitive weight directions into a high‑precision low‑rank branch. This is achieved via a closed‑form truncated SVD on the scaled weight matrix W G^1/2_x . Experimental results on Qwen3‑4B‑Thinking‑2507 demonstrate that ARHQ significantly improves layer‑wise SNR and preserves downstream reasoning performance on ZebraLogic even under aggressive quantization. The code is available at https://github.com/BeautMoonQ/ARHQ.

Authors:Binghao Huang, Yunzhu Li
Title: FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems
Abstract:
We present FlexiTac, a low‑cost, open‑source, and scalable piezoresistive tactile sensing solution designed for robotic end‑effectors. FlexiTac is a practical "plug‑in" module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi‑channel readout board that streams synchronized measurements for real‑time control and large‑scale data collection. FlexiTac pads adopt a sealed three‑layer laminate stack (FPC‑Velostat‑FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low‑cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo‑tactile fusion for contact‑aware decision making, cross‑embodiment skill transfer, and real‑to‑sim‑to‑real fine‑tuning with GPU‑parallel tactile simulation. Our project page is available at https://flexitac.github.io/.

Authors:Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, Mohammad Masudur Rahman
Title: DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
Abstract:
Transformer models are widely deployed in critical AI applications, yet faults in their attention mechanisms, projections, and other internal components often degrade behavior silently without raising runtime errors. Existing fault diagnosis techniques often target generic deep neural networks and cannot identify which transformer component is responsible for an observed symptom. In this article, we present DEFault++, a hierarchical learning‑based diagnostic technique that operates at three level of abstraction: it detects whether a fault is present, classifies it into one of 12 transformer‑specific fault categories (covering both attention‑internal mechanisms and surrounding architectural components), and identifies the underlying root cause from up to 45 mechanisms. To facilitate both training and evaluation, we construct DEFault‑bench, a benchmark of 3,739 labeled instances obtained through systematic mutation testing. These instances are created across seven transformer models and nine downstream tasks using DEForm, a transformer‑specific mutation technique we developed for this purpose. DEFault++ measures runtime behavior at the level of individual transformer components. It organizes these measurements through a Fault Propagation Graph (FPG) derived from the transformer architecture. It then produces an interpretable diagnosis using prototype matching combined with supervised contrastive learning. On DEFault‑bench, DEFault++ exceeds an AUROC of 0.96 for detection and a Macro‑F1 of 0.85 for both categorization and root‑cause diagnosis on encoder and decoder architectures. In a developer study with 21 practitioners, the accuracy of choosing correct repair actions increased from 57.1% without support to 83.3% when using DEFault++.

Authors:Arthur Corrêa, Paulo Nascimento, Samuel Moniz
Title: FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing
Abstract:
Solving practical multi‑depot vehicle routing problems (MDVRP) is a challenging optimization task central to modern logistics, increasingly driven by e‑commerce. To address the MDVRP's computational complexity, neural‑based combinatorial optimization methods offer a promising scalable alternative to traditional approaches. However, neural‑based methods typically rely on rigid architectures and input encodings tailored to specific problem formulations. In real‑world settings, heterogeneous constraints create multiple MDVRP variants, limiting the applicability of such models. While multi‑task learning (MTL) has begun to accelerate the development of unified neural‑based solvers, prior works focus almost exclusively on single‑depot VRPs, leaving the MDVRP unaddressed. To bridge this gap, we propose Feature‑wise Linear Modulation for Cross‑Problem Multi‑Depot Vehicle Routing (FiLMMeD), a novel unified neural‑based model for 24 different MDVRP variants. We introduce three main contributions: (1) to improve the model's generalization, we augment the standard Transformer encoder with Feature‑wise Linear Modulation (FiLM), which dynamically conditions learned internal representations based on the active set of constraints; (2) we provide an initial demonstration of Preference Optimization in the MTL setting, establishing it as a superior alternative to Reinforcement Learning for future MTL works; (3) to mitigate the generalization gap caused by the introduction of multi‑depot constraints, we introduce a targeted curriculum learning strategy that progressively exposes the model to increasingly more complex constraint interactions. Extensive experiments on 24 MDVRP variants (including 8 novel formulations) and 16 single‑depot VRPs confirm the effectiveness of FiLMMeD, which consistently outperforms state‑of‑the‑art baselines. Our code is available at: https://github.com/AJ‑Correa/FiLMMeD/tree/main

Authors:Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani, Ziru Chen, Huan Sun
Title: D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
Abstract:
Despite recent progress in language models and agents for scientific data‑driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real‑world scientific tasks. To fill this gap, we introduce D3‑Gym, the first automatically constructed dataset with verifiable environments for scientific Data‑Driven Discovery. D3‑Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre‑installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3‑Gym confirms that our evaluation scripts achieve 87.5% agreement with human‑annotated gold standards and strong alignment in domain‑specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3‑Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3‑32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3‑Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU‑NLP‑Group/D3‑Gym.

Authors:Ishrak Hamim Mahi, Siam Ferdous, Md Sakib Sadman Badhon, Nabid Hasan Omi, Md Habibun Nabi Hemel, Farig Yousuf Sadeque, Md. Tanzim Reza
Title: Machine Unlearning for Class Removal through SISA-based Deep Neural Network Architectures
Abstract:
The rapid proliferation of image generation models and other artificial intelligence (AI) systems has intensified concerns regarding data privacy and user consent. As the availability of public datasets declines, major technology companies increasingly rely on proprietary or private user data for model training, raising ethical and legal challenges when users request the deletion of their data after it has influenced a trained model. Machine unlearning seeks to address this issue by enabling the removal of specific data from models without complete retraining. This study investigates a modified SISA (Sharded, Isolated, Sliced, and Aggregated) framework designed to achieve class‑level unlearning in Convolutional Neural Network (CNN) architectures. The proposed framework incorporates a reinforced replay mechanism and a gating network to enhance selective forgetting efficiency. Experimental evaluations across multiple image datasets and CNN configurations demonstrate that the modified SISA approach enables effective class unlearning while preserving model performance and reducing retraining overhead. The findings highlight the potential of SISA‑based unlearning for deployment in privacy‑sensitive AI applications. The implementation is publicly available at https://github.com/SiamFS/ sisa‑class‑unlearning.

Authors:Al Zadid Sultan Bin Habib, Tanpia Tasnim, Md. Ekramul Islam, Muntasir Tabasum
Title: ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data
Abstract:
Learning informative representations from tabular data in remote sensing and environmental science is challenging due to heterogeneity, scarce labels, and redundancy among features. We present ZAYAN (Zero‑Anchor dYnamic feAture eNcoding), a self‑supervised, feature‑centric contrastive framework for tabular data. ZAYAN performs contrastive learning at the feature rather than sample level, removing the need for explicit anchor selection and any reliance on class labels, while encouraging a redundancy‑minimized, disentangled embedding space. The framework has two modules: ZAYAN‑CL, which pretrains feature embeddings via a zero‑anchor contrastive objective with dynamic perturbations and masking, and ZAYAN‑T, a Transformer that conditions on these embeddings for downstream classification. Across eight datasets, including six remote‑sensing tabular benchmarks and two remote‑sensing‑driven flood‑prediction tables from satellite and GIS products, ZAYAN achieves superior accuracy, robustness, and generalization over tabular deep learning baselines, with consistent gains under label scarcity and distribution shift. These results indicate that feature‑level contrastive learning and dynamic feature encoding provide an effective recipe for learning from tabular sensing data.

Authors:Ethan Bito, Yongli Ren, Estrid He
Title: One Pass, Any Order: Position-Invariant Listwise Reranking for LLM-Based Recommendation
Abstract:
Large language models (LLMs) are increasingly used for recommendation reranking, but their listwise predictions can depend on the order in which candidates are presented. This creates a mismatch between the set‑based nature of recommendation and the sequence‑based computation of decoder‑only LLMs, where permuting an otherwise identical candidate set can change item scores and final rankings. Such order sensitivity makes LLM‑based rerankers difficult to rely on, since rankings may reflect prompt serialization rather than user preference. We propose InvariRank, a permutation‑invariant listwise reranking framework that addresses this dependence at the architectural level. InvariRank blocks cross‑candidate attention with a structured attention mask and negates position‑induced scoring changes through shared positional framing under Rotary Positional Embeddings (RoPE). Combined with a listwise learning‑to‑rank objective, the model scores all candidates in a single forward pass, avoiding permutation‑based invariance training objectives that require multiple permutations of a candidate set. Experiments on recommendation benchmarks show that InvariRank maintains competitive ranking effectiveness while producing stable rankings across candidate permutations. The results suggest that architectural invariance is a practical route to reliable and efficient LLM‑based recommendation reranking. The source code is at https://github.com/ejbito/InvariRank.

Authors:Eichi Uehara
Title: Bayesian X-Learner: Calibrated Posterior Inference for Heterogeneous Treatment Effects under Heavy-Tailed Outcomes
Abstract:
Conditional Average Treatment Effect (CATE) estimation in practice demands three properties simultaneously: heterogeneous effects τ(x), calibrated uncertainty over them, and robustness to the heavy tails that contaminate real outcome data. Meta‑learners (Künzel et al., 2019) give (i); causal forests and BART give (i)‑(ii) with Gaussian‑tail assumptions; no widely used tool gives all three. We present Bayesian X‑Learner, an X‑Learner built on cross‑fitted doubly robust pseudo‑outcomes (Kennedy, 2020) with a full MCMC posterior over τ(x) via a Welsch redescending pseudo‑likelihood. On Hill's IHDP benchmark the default configuration attains mean \sqrt\varepsilon_\mathrmPEHE = 0.56 on 5 replications (lowest mean; differences from S‑/T‑/X‑learners, full‑config Causal BART, and a causal forest baseline are not significant at α=0.05, and rank ordering is unstable at 10 replications ‑‑ IHDP comparisons are competitive rather than dominant). On contaminated "whale" DGPs with up to 20‑25% tail density, a one‑flag extension (contamination_severity) that selects a Huber‑δ nuisance loss per Huber's minimax‑δ relation recovers RMSE \approx 0.13 with tight credible intervals (single‑cross‑fit 30‑seed coverage 83% [Wilson 66%, 93%] at 20% density; modular‑Bayes pooling with Bayesian‑bootstrap nuisance draws restores nominal 95% coverage).

Authors:Vijay Sadashivaiah, Georgios Dasoulas, Judith Mueller, Soumya Ghosh
Title: Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models
Abstract:
Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single‑cell datasets, sigmoid achieves 25% higher cell‑type separation, better cell‑type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (\leq 0.25) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M‑parameter bidirectional attention models trained without gradient clipping on 8K‑token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open‑source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention‑2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton‑sigmoid

Authors:Yibin Luo, Shiwei Gao, Huichuan Zheng, Youyou Lu, Jiwu Shu
Title: Efficient Training on Multiple Consumer GPUs with RoundPipe
Abstract:
Fine‑tuning Large Language Models (LLMs) on consumer‑grade GPUs is highly cost‑effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline's throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles. In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round‑robin manner, achieving a near‑zero‑bubble pipeline. To ensure training correctness and system efficiency, RoundPipe integrates a priority‑aware transfer scheduling engine, a fine‑grained distributed event‑based synchronization protocol, and an automated layer partitioning algorithm. Evaluations on an 8× RTX 4090 server demonstrate that RoundPipe achieves 1.48‑‑2.16× speedups over state‑of‑the‑art baselines when fine‑tuning 1.7B to 32B models. Remarkably, RoundPipe enables LoRA fine‑tuning of the Qwen3‑235B model with 31K sequence length on a single server. RoundPipe is publicly available as an open‑source Python library with comprehensive documentation.

Authors:Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan
Title: Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
Abstract:
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state‑of‑the‑art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross‑architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross‑architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise‑dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross‑tokenizer objective that inverts chunk‑level likelihood matching, yielding bounded gradients and dual‑end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

Authors:Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao
Title: ClawGym: A Scalable Framework for Building Effective Claw Agents
Abstract:
Claw‑style environments support multi‑step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw‑style personal agent development. Concretely, we construct ClawGym‑SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona‑driven intents and skill‑grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw‑style models, termed ClawGym‑Agents, through supervised fine‑tuning on black‑box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per‑task sandboxes.To support reliable evaluation, we further construct ClawGym‑Bench, a benchmark of 200 instances calibrated through automated filtering and human‑LLM review. Relevant resources will be soon released at https://github.com/ClawGym.

Authors:Albert Saiapin, Kim Batselier
Title: Laplace Approximation for Bayesian Tensor Network Kernel Machines
Abstract:
Uncertainty estimation is essential for robust decision‑making in the presence of ambiguous or out‑of‑distribution inputs. Gaussian Processes (GPs) are classical kernel‑based models that offer principled uncertainty quantification and perform well on small‑ to medium‑scale datasets. Alternatively, formulating the weight space learning problem under tensor network assumptions yields scalable tensor network kernel machines. However, these assumptions break Gaussianity, complicating standard probabilistic inference. This raises a fundamental question: how can tensor network kernel machines provide principled uncertainty estimates? We propose a novel Bayesian Tensor Network Kernel Machine (LA‑TNKM) that employs a (linearized) Laplace approximation for Bayesian inference. A comprehensive set of numerical experiments shows that the proposed method consistently matches or surpasses Gaussian Processes and Bayesian Neural Networks (BNNs) across diverse UCI regression benchmarks, highlighting both its effectiveness and practical relevance.

Authors:Pedro R. Pires, Gregorio F. Azevedo, Rafael T. Sereicikas, Pietro L. Campos, Tiago A. Almeida
Title: The Bandit's Blind Spot: The Critical Role of User State Representation in Recommender Systems
Abstract:
With the increasing availability of online information, recommender systems have become an important tool for many web‑based systems. Due to the continuous aspect of recommendation environments, these systems increasingly rely on contextual multi‑armed bandits (CMAB) to deliver personalized and real‑time suggestions. A critical yet underexplored component in these systems is the representation of user state, which typically encapsulates the user's interaction history and is deeply correlated with the model's decisions and learning. In this paper, we investigate the impact of different embedding‑based state representations derived from matrix factorization models on the performance of traditional CMAB algorithms. Our large‑scale experiments reveal that variations in state representation can lead to improvements greater than those achieved by changing the bandit algorithm itself. Furthermore, no single embedding or aggregation strategy consistently dominates across datasets, underscoring the need for domain‑specific evaluation. These results expose a substantial gap in the literature and emphasize that advancing bandit‑based recommender systems requires a holistic approach that prioritizes embedding quality and state construction alongside algorithmic innovation. The source code for our experiments is publicly available on https://github.com/UFSCar‑LaSID/bandits_blind_spot.

Authors:Seungyub Han, Hyungjin Kim, Jungwoo Lee
Title: Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
Abstract:
Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self‑Alignment for Safety), a transformer‑based framework that enables test‑time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self‑alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in‑context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov‑guided imagination into control‑invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.

Authors:Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni
Title: Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models
Abstract:
Deploying Vision‑Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource‑constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth‑limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge‑cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed‑size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge‑cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug‑and‑play deployment with off‑the‑shelf VLMs without additional fine‑tuning. This design allows flexible transmission at different information levels, providing a controllable trade‑off between communication cost and semantic fidelity. We implement a full end‑to‑end edge‑cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth‑constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full‑edge and full‑cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open‑ep/ProSemComVLM.

Authors:Zhe Ding, Su Pan, Duowei Pan
Title: CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
Abstract:
Post‑training quantization (PTQ) has become an important technique for reducing the inference cost of Large Language Models (LLMs). While recent mixed‑precision methods improve ultra‑low bit quantization by preserving critical subspaces in high precision, they typically construct these subspaces relying solely on activation statistics. This ignores the fundamental nature of linear operations, where the output perturbation is jointly driven by both activation and weight quantization noise. In this paper, we propose CoQuant, a joint weight‑activation subspace projection method. By theoretically modeling the expected output error, CoQuant formulates a closed‑form weighted PCA solution that balances activation and weight covariances to select the optimal high‑precision subspace. Extensive experiments on Llama‑3.2 and Qwen2.5 models show that CoQuant consistently outperforms strong PTQ baselines in both WikiText perplexity and zero‑shot common‑sense reasoning accuracy. These results demonstrate that joint weight‑activation subspace modeling provides a principled and effective direction for low‑bit LLM quantization. The source code is available at https://github.com/Zachary5895/CoQuant.

Authors:Zhirong Shen, Rui Huang, Jiacheng Liu, Chang Zou, Peiliang Cai, Shikang Zheng, Zhengyi Shi, Liang Feng, Linfeng Zhang
Title: Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
Abstract:
To address the high sampling cost of Diffusion Transformers (DiTs), feature caching offers a training‑free acceleration method. However, existing methods rely on hand‑crafted forecasting formulas that fail under aggressive skipping. We propose L2P (Learnable Linear Predictor), a simple data‑driven caching framework that replaces fixed coefficients with learnable per‑timestep weights. Rapidly trained in ~20 seconds on a single GPU, L2P accurately reconstructs current features from past trajectories. L2P significantly outperforms existing baselines: it achieves a 4.55x FLOPs reduction and 4.15x latency speedup on FLUX.1‑dev, and maintains high visual fidelity under up to 7.18x acceleration on Qwen‑Image models, where prior methods show noticeable quality degradation. Our results show learning linear predictors is highly effective for efficient DiT inference. Code is available at https://github.com/Aredstone/L2P‑Cache.

Authors:Fengchun Zhang, Qiang Ma, Liuyu Xiang, Jinshan Lai, Tingxuan Huang, Jianwei Hu
Title: CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID
Abstract:
Federated domain generalization for person re‑identification (FedDG‑ReID) aims to collaboratively train a pedestrian retrieval model across multiple decentralized source domains such that it can generalize to unseen target environments without compromising raw data privacy. However, this task is significantly challenged by the inherent stylistic gaps across decentralized clients. Without global supervision, models easily succumb to shortcut learning where representations overfit to domain specific camera biases rather than universal identity features. We propose CO‑EVO, a novel federated framework that resolves this semantic‑style conflict through a co‑evolutionary mechanism. On the semantic side, Camera‑Invariant Semantic Anchoring (CSA) learns identity prompts with cross‑camera consistency to establish purified and domain‑agnostic anchors that filter out local imaging noise. On the visual side, Global Style Diversification (GSD), powered by a Global Camera‑Style Bank (GCSB), synthesizes realistic perturbations to expand the visual boundaries of training data. The core of CO‑EVO is its co‑evolutionary loop where purified anchors act as gravitational centers to guide the image encoder toward robust anatomical attributes amidst diverse style variations. Extensive experiments demonstrate that CO‑EVO achieves state‑of‑the‑art (SOTA) performance, proving that the synergy between semantic purification and style expansion is essential for robust cross‑domain generalization. Our code is available at: https://github.com/NanYiyuzurn/ACL‑LGPS‑2026.

Authors:Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan
Title: Efficient, VRAM-Constrained xLM Inference on Clients
Abstract:
To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high‑accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark‑profile‑guided CPU‑GPU hybrid scheduling technique to achieve efficient, VRAM‑constrained inference for both dense and mixture‑of‑experts (MoE) LLMs. Using a combination of model sharding at the sub‑layer level, CPU offloading, pipelined copy‑compute, and prioritized tensor placement in VRAM, it optimizes both time‑to‑first‑token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high‑accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well‑understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products ‑ the In‑Game Inferencing software development kit (IGI SDK) and the Cosmos‑Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7x and TPS by up to 30x for LLMs, and CR1 inference's VRAM demand is down by 10x, while in batched mode, throughput improves by up to 8.2x, all compared to their respective aggressive baselines. This paper is accepted at the 9th MLSys Conference (Industry Track), 2026. Code and artifact available at: https://github.com/deepshnv/pipeshard‑mlsys26‑ae

Authors:Wenshuo Zhao, Qi Zhu, Xingshan Zeng, Fei Mi, Lifeng Shang, Yi R., Fung
Title: Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
Abstract:
An effective way to scale up test‑time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high‑entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment‑level uncertainty as the High Entropy Phase (HEP), a variable‑length segment that begins at a high‑entropy token and ends when consecutive low‑entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust‑nlp/entropy‑centroid.

Authors:Mohammed Suhail B Nadaf
Title: reward-lens: A Mechanistic Interpretability Library for Reward Models
Abstract:
Every RLHF‑trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit ‑‑ logit lens, direct logit attribution, activation patching, sparse autoencoders ‑‑ was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward‑lens, an open‑source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector w_r is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three‑mode activation patching, a reward‑hacking probe suite, TopK SAE feature attribution, cross‑model comparison, and five theory‑grounded extensions (distortion index, divergence‑aware patching, misalignment cascade detection, reward‑term conflict analysis, concept‑vector analysis). A ten‑method adapter protocol covers Llama, Mistral, Gemma‑2, and ArmoRM multi‑objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman ρ= ‑0.256 on Skywork, ‑0.027 on ArmoRM). The framework treats this disagreement as a property to expose, not a bug ‑‑ motivating a design that keeps observational and causal views first‑class and directly comparable.

Authors:Emre Ardıç, Yakup Genç
Title: Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data
Abstract:
Federated learning is a machine learning paradigm in which multiple devices collaboratively train a model under the supervision of a central server while ensuring data privacy. However, its performance is often hindered by redundant, malicious, or abnormal samples, leading to model degradation and inefficiency. To overcome these issues, we propose novel sample selection methods for image classification, employing a multitask autoencoder to estimate sample contributions through loss and feature analysis. Our approach incorporates unsupervised outlier detection, using one‑class support vector machine (OCSVM), isolation forest (IF), and adaptive loss threshold (AT) methods managed by a central server to filter noisy samples on clients. We also propose a multi‑class deep support vector data description (SVDD) loss controlled by a central server to enhance feature‑based sample selection. We validate our methods on CIFAR10 and MNIST datasets across varying numbers of clients, non‑IID distributions, and noise levels up to 40%. The results show significant accuracy improvements with loss‑based sample selection, achieving gains of up to 7.02% on CIFAR10 with OCSVM and 1.83% on MNIST with AT. Additionally, our federated SVDD loss further improves feature‑based sample selection, yielding accuracy gains of up to 0.99% on CIFAR10 with OCSVM. These results show the effectiveness of our methods in improving model accuracy across various client counts and noise conditions.

Authors:Dominik Żurek, Kamil Faber, Marcin Pietron, Paweł Gajewski, Roberto Corizzo
Title: TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning
Abstract:
Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live environment interactions is expensive, risky, or impossible. However, CORL inherits the dual difficulty of offline reinforcement learning and adapting while preventing catastrophic forgetting. Replay‑based continual learning approaches remain a strong baseline but incur memory overhead and suffer from a distribution mismatch between replayed samples and newly learned policies. At the same time, architectural continual learning methods have shown strong potential in supervised learning but remain underexplored in CORL. In this work, we propose TSN‑Affinity, a novel CORL method based on TinySubNetworks and Decision Transformer. The method enables task‑specific parameterization and controlled knowledge sharing through a RL‑aware reuse strategy that routes tasks according to action compatibility and latent similarity. We evaluate the approach on benchmarks based on Atari games and simulations of manipulation tasks with the Franka Emika Panda robotic arm, covering both discrete and continuous control. Results show strong retention from sparse SubNetworks, with routing further improving multi‑task performance. Our findings suggest that similarity‑guided architectural reuse is a strong and viable alternative to replay‑based strategies in a CORL setting. Our code is available at: https://github.com/anonymized‑for‑submission123/tsn‑affinity.

Authors:Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin
Title: When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Abstract:
Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.

Authors:Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, Dongdong Ge
Title: From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling
Abstract:
Optimization modeling underpins real‑world decision‑making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural‑language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emphAgora‑Opt, a modular agentic framework for optimization modeling that combines decentralized debate with a read‑write memory bank. Agora‑Opt allows multiple agent teams to independently produce end‑to‑end solutions and reconcile them through an outcome‑grounded debate protocol, while memory stores solver‑verified artifacts and past disagreement resolutions to support training‑free improvement over time. This design is flexible across both backbones and methods: it reduces base‑model lock‑in, transfers across different LLM families, and can be layered onto existing pipelines with minimal coupling. Across public benchmarks, Agora‑Opt achieves the strongest overall performance among all compared methods, outperforming strong zero‑shot LLMs, training‑centric approaches, and prior agentic baselines. Further analyses show robust gains across backbone choices and component variants, and demonstrate that decentralized debate offers a structural advantage over centralized selection by enabling agents to refine candidate solutions through interaction and even recover correct formulations when all initial candidates are flawed. These results suggest that reliable optimization modeling benefits from combining collaborative cross‑checking with reusable experience, and position Agora‑Opt as a practical and extensible foundation for trustworthy optimization modeling assistance. Our code and data are available at https://github.com/CHIANGEL/Agora‑Opt.

Authors:Oliver Kraus, Yash Sarrof, Yuekun Yao, Alexander Koller, Michael Hahn
Title: Barriers to Universal Reasoning With Transformers (And How to Overcome Them)
Abstract:
Chain‑of‑Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that ‑‑ under standard positional encodings and a finite alphabet ‑‑ Transformers with CoT cannot solve problems beyond TC^0, i.e. the expressivity benefits do not hold under the stricter requirement of length‑generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length‑generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last‑occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.

Authors:Tri-Nhan Vo, Dang Nguyen, Kien Do, Sunil Gupta
Title: Improving Diversity in Black-box Few-shot Knowledge Distillation
Abstract:
Knowledge distillation (KD) is a well‑known technique to effectively compress a large network (teacher) to a smaller network (student) with little sacrifice in performance. However, most KD methods require a large training set and internal access to the teacher, which are rarely available due to various restrictions. These challenges have originated a more practical setting known as black‑box few‑shot KD, where the student is trained with few images and a black‑box teacher. Recent approaches typically generate additional synthetic images but lack an active strategy to promote their diversity, a crucial factor for student learning. To address these problems, we propose a novel training scheme for generative adversarial networks, where we adaptively select high‑confidence images under the teacher's supervision and introduce them to the adversarial learning on‑the‑fly. Our approach helps expand and improve the diversity of the distillation set, significantly boosting student accuracy. Through extensive experiments, we achieve state‑of‑the‑art results among other few‑shot KD methods on seven image datasets. The code is available at https://github.com/votrinhan88/divbfkd.

Authors:Sehyeon Oh, Yongin Kwon, Jemin Lee
Title: QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention
Abstract:
FlashAttention improves efficiency through tiling, but its online softmax still relies on floating‑point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer‑only FlashAttention: (1) scale explosion during tile‑wise accumulation, (2) inefficient shift‑based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose QFlash, an end‑to‑end integer FlashAttention design that performs softmax entirely in the integer domain and runs as a single Triton kernel. On seven attention workloads from ViT, DeiT, and Swin models, QFlash achieves up to 6.73× speedup over I‑ViT and up to 8.69× speedup on Swin, while reducing energy consumption by 18.8% compared to FP16 FlashAttention, without sacrificing Top‑1 accuracy on ViT/DeiT and remaining competitive on Swin under per‑tensor quantization. Our code is publicly available at https://github.com/EfficientCompLab/qflash.

Authors:Chenbo Yu
Title: DGLight: DQN-Guided GRPO Fine-Tuning of Large Language Models for Traffic Signal Control
Abstract:
Traffic signal control (TSC) plays a central role in reducing congestion and maintaining urban mobility. This dissertation introduces DGLight, a critic‑guided reinforcement‑learning framework for adapting a pretrained large language model to TSC. DGLight first trains a CoLight‑based Deep Q‑Network critic to estimate traffic‑aware action values from structured intersection states, then uses the frozen critic to score candidate language‑model actions and optimize the policy with Group Relative Policy Optimization (GRPO). The resulting controller maps traffic states to interpretable reasoning traces and signal decisions while learning from dense per‑state supervision rather than raw cumulative environment rewards. Experiments on TSC benchmarks covering Jinan and Hangzhou show that DGLight is the strongest overall method among the compared LLM‑based controllers, remains competitive with strong RL baselines, and transfers well to city datasets not used to fit the critic. Qualitative examples further show that the model's generated reasoning is interpretable and aligned with the chosen signal phase. The project code is available \hrefhttps://github.com/yyccbb/FYP_LLMTSChere.

Authors:Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi
Title: VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
Abstract:
Vision‑language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution‑free framework that converts a judge's point score into a calibrated prediction interval using only score‑token log‑probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM‑as‑a‑Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task‑dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking‑scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi‑annotator captioning benchmark. Code: https://github.com/divake/VLM‑Judge‑Uncertainty

Authors:Alexander Kolpakov, Igor Rivin
Title: DiRe-RAPIDS: Topology-faithful dimensionality reduction at scale
Abstract:
Dimensionality reduction methods such as UMAP and t‑SNE are central tools for visualising high‑dimensional data, but their local‑neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top‑performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology‑faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto‑optimal configurations that match or beat GPU‑accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3‑4 times more topological structure than UMAP at comparable wall‑clock.

Authors:Douglas Brinkerhoff, Elizabeth Fischer
Title: Conditional Flow Matching for Probabilistic Downscaling of Maximum 3-day Snowfall in Alaska
Abstract:
Precipitation in complex terrain is governed by orographic processes operating at scales of a few kilometers, yet climate models typically run at resolutions of 50‑‑100~km where this topographic detail is absent. Dynamical downscaling with high‑resolution regional models such as WRF can resolve these processes, but the computational cost ‑‑ months of wall‑clock time per scenario ‑‑ precludes the large ensembles needed for uncertainty quantification. We present WxFlow, a conditional generative model based on flow matching that learns to map coarse‑resolution climate model output and high‑resolution topography to calibrated probabilistic ensembles of fine‑scale precipitation fields. Applied to 4~km WRF simulations of maximum 3‑day snowfall over southeast Alaska, WxFlow achieves 87.8% improvement in spectral fidelity and dramatically lower Continuous Ranked Probability Scores relative to conventional lapse‑rate‑corrected bicubic downscaling, while generating 50‑member ensembles in seconds on a laptop. Ensemble spread is spatially coherent and governed by topography, reflecting physically plausible uncertainty structure. All code is available at https://github.com/glide‑ism/wrf‑flow.

Authors:Lei Wang
Title: Quantum Dynamics via Score Matching on Bohmian Trajectories
Abstract:
We solve the time‑dependent Schrödinger equation by learning the score function, the gradient of the log‑probability density, on Bohmian trajectories. In Bohm's formulation of quantum mechanics, particles follow deterministic paths under the classical potential supplemented by a quantum potential depending on the score function of the evolving density. These non‑crossing Bohmian trajectories form a continuous normalizing flow governed by the score. We parametrize the score with a neural network and minimize a self‑consistent Fisher divergence between the network and the score of the resulting density. We prove that the zero‑loss minimizer of this self‑consistent objective recovers Schrödinger dynamics for nodeless wave functions, a condition naturally met in quantum vibrations of atoms. We demonstrate the approach on wavepacket splitting in a double‑well potential and anharmonic vibrations of a Morse chain. By recasting real‑time quantum dynamics as a self‑consistent score‑driven normalizing flow, this framework opens the time‑dependent Schrödinger equation to the rapidly advancing toolkit of modern generative modeling.

Authors:Christian Lysenstøen
Title: Feasible-First Exploration for Constrained ML Deployment Optimization in Crash-Prone Hierarchical Search Spaces
Abstract:
Deploying machine learning models under production constraints requires joint optimization over model family, quantization scheme, runtime backend, and serving configuration. This induces a hierarchical mixed‑variable search space in which many configurations are invalid: evaluations may crash, exceed memory limits, or violate latency constraints. Standard black‑box optimizers such as Tree‑structured Parzen Estimators (TPE) and constrained Bayesian optimization are effective when valid configurations are common, but they can spend a large fraction of a small evaluation budget on invalid or uninformative trials in hostile deployment spaces. This paper studies that regime and asks whether optimization should be decomposed into an explicit exploration stage followed by model‑guided exploitation. We propose Thermal Budget Annealing (TBA), a feasible‑first exploration procedure that maps valid and feasible regions before warm‑starting TPE. The method includes two robustness mechanisms for hostile hardware: trial timeouts that abort clearly infeasible evaluations early, and subspace blacklisting that temporarily suppresses categorical subspaces after repeated failures. We also introduce DeployBench, a benchmark suite for deployment optimization with hierarchical structure, hidden crash zones, hard constraints, and unequal evaluation costs. On synthetic benchmarks and real GPU deployment with five pre‑trained vision models across five GPU targets (NVIDIA H100, A100, RTX 5080, L4, and T4), the proposed hybrid improves model‑family discovery under tight constraints while reducing wasted budget relative to cold‑start TPE.

Authors:Dhruv Gupta
Title: Null Measurability at the Symmetrization Interface in VC Learning
Abstract:
Recent work revisiting measurability in the fundamental theorem of statistical learning imposes Borel measurability of ghost‑gap suprema. We show that, at the one‑sided ghost‑gap interface actually used by the standard symmetrization proof, this requirement is stronger than necessary. For any Borel‑parameterized concept class on a Polish domain, the bad event "there exists a hypothesis whose ghost empirical error exceeds its training empirical error by at least ε/2" is analytic. By Choquet capacitability, it is therefore measurable in the completion of every finite Borel measure. We then construct a concept class whose bad event is null‑measurable but not Borel, giving a strict separation from the Borel supremum condition. Finally, we prove closure under patching, fixed and countable interpolation, and fiber‑product amalgamation, showing that the weaker regularity level is stable under natural concept‑class constructors. In the realizable setting, where targets belong to the class and are measurable, these results weaken the measurability hypothesis needed by the symmetrization route from finite VC dimension to PAC learnability. The main results and the descriptive‑set‑theoretic infrastructure used by them are formalized in Lean 4.

Authors:Ishan Patel, Ishan Joshi
Title: PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
Abstract:
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent ‑‑ the standard paradigm ‑‑ PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE ‑‑ a Fast Walsh‑Hadamard Transform (FWHT) rotation followed by 3‑bit Lloyd‑Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2‑1.7B‑Instruct and Llama‑3‑8B‑Instruct), three context lengths (600‑7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama‑3‑8B with 15 agents sharing a 4K‑token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB ‑‑ a 97.7% reduction ‑‑ while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to ‑0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy‑compressed KV pool with multi‑reader concurrent agent access.

Authors:Jing Chen, Abhijay Deevi, Onat Gungor, Tajana Rosing
Title: CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic
Abstract:
The Controller Area Network (CAN) is a safety‑critical in‑vehicle communication protocol that lacks built‑in security mechanisms, making intrusion detection essential. Existing approaches predominantly formulate CAN intrusion detection as a classification task, mapping complex traffic patterns to attack labels. However, this formulation abstracts away the temporal and relational structure of CAN traffic and misaligns with real‑world forensic workflows, which require systematic reasoning about traffic behavior. To address this gap, we introduce CAN‑QA, the first benchmark that reformulates CAN traffic analysis as a question‑answering (QA) task. CAN‑QA converts raw CAN logs into temporally segmented windows and applies deterministic rule‑based templates to generate natural‑language questions paired with automatically derived ground‑truth answers. The resulting dataset comprises 33,128 QA pairs across 10 categories, each targeting distinct semantic and temporal properties of CAN traffic. Using CAN‑QA, we evaluate large language models across both True/False and multiple‑choice formats. Our results indicate that, although these models capture superficial statistical regularities, they struggle with temporal reasoning, multi‑condition inference, and higher‑level behavioral interpretation. Our code is available at https://github.com/Kriiiiss/CAN‑QA.

Authors:Yuanhao Zeng, Ao Lu, Lufei Li, Zheng Zhang, Yexin Li, Kan Ren
Title: Large Language Models Explore by Latent Distilling
Abstract:
Generating diverse responses is crucial for test‑time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface‑level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well‑known observation that neural networks tend to make lower‑error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep‑layer hidden representations of the LLM from its shallow‑layer representations to model the LLM's depth‑wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less‑explored semantic patterns. ESamp is implemented with an asynchronous training‑‑inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade‑off between diversity and coherence in creative writing. Our code has released at: https://github.com/LinesHogan/tLLM.

Authors:Antoine P. Leeman, Shuyu Zhan, Melanie N. Zeilinger, Glen Chou
Title: VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis
Abstract:
We propose VISION‑SLS, a method for nonlinear output‑feedback control from high‑resolution RGB images which provides robust constraint satisfaction guarantees under calibrated uncertainty bounds despite partial observability, sensor noise, and nonlinear dynamics. To enable scalability while retaining guarantees, we propose: (i) a learned low‑dimensional observation map from pretrained visual features with state‑dependent error bounds, and (ii) a causal affine time‑varying output‑feedback policy optimized via System Level Synthesis (SLS). We develop a scalable, novel solver for the resulting nonconvex program that leverages sequential convex programming coupled with efficient Riccati recursions. On two simulated visuomotor tasks (a 4D car and a 10D quadrotor) with >= 512 x 512 pixels and a 59D humanoid task with partial observability, our method enables safe, information‑gathering behavior that reduces uncertainty while guaranteeing constraint satisfaction with empirically‑calibrated error bounds. We also validate our method on hardware, safely controlling a ground vehicle from onboard images, outperforming baselines in safety rate and solve times. Together, these results show that learned visual abstractions coupled with an efficient solver make SLS‑based safe visuomotor output‑feedback practical at scale. The code implementation of our method is available at https://github.com/trustworthyrobotics/VISION‑SLS.

Authors:Maitreya Patel, Jingtao Li, Weiming Zhuang, Yezhou Yang, Lingjuan Lv
Title: VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
Abstract:
We introduce an efficient, resolution‑agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution‑agnostic 1D Transformer‑based image tokenizer that encodes images into a dynamic, user‑controllable sequence of 32‑256 tokens, achieving a state‑of‑the‑art efficiency and performance trade‑off. Building on VibeToken, we present VibeToken‑Gen, a class‑conditioned AR generator with out‑of‑the‑box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken‑Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion‑based state‑of‑the‑art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed‑resolution AR models such as LlamaGen ‑‑ whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) ‑‑ VibeToken‑Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.

Authors:Nishit Anand, Manan Suri, Christopher Metzler, Dinesh Manocha, Ramani Duraiswami
Title: Learning Illumination Control in Diffusion Models
Abstract:
Controlling illumination in images is essential for photography and visual content creation. While closed‑source models have demonstrated impressive illumination control, open‑source alternatives either require heavy control inputs like depth maps or do not release their data and code. We present a fully open‑source and reproducible pipeline for learning illumination control in diffusion models. Our approach builds a data engine that transforms well‑lit images into supervised training triplets consisting of a poorly‑illuminated input image, a natural language lighting instruction, and a well‑illuminated output image. We finetune a diffusion model on this data and demonstrate significant improvements over baseline SD 1.5, SDXL, and FLUX.1‑dev models in perceptual similarity, structural similarity, and identity preservation. Our work provides a reproducible solution built entirely with open‑source tools and publicly available data. We release all our code, data, and model weights publicly.

Authors:Tingwu Wang, Olivier Dionne, Michael De Ruyter, David Minor, Davis Rempe, Kaifeng Zhao, Mathis Petrovich, Ye Yuan, Chenran Li, Zhengyi Luo, Brian Robison, Xavier Blackwell, Bernardo Antoniazzi, Xue Bin Peng, Yuke Zhu, Simon Yuen
Title: MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives
Abstract:
Despite transformative advances in generative motion synthesis, real‑time interactive motion control remains dominated by traditional techniques. In this work, we identify two key challenges in bridging research and production: 1) Real‑time scalability: Industry applications demand real‑time generation of a vast repertoire of motion skills, while generative methods exhibit significant degradation in quality and scalability under real‑time computation constraints, and 2) Integration: Industry applications demand fine‑grained multi‑modal control involving velocity commands, style selection, and precise keyframes, a need largely unmet by existing text‑ or tag‑driven models. To overcome these limitations, we introduce MotionBricks: a large‑scale, real‑time generative framework with a two‑fold solution. First, we propose a large‑scale modular latent generative backbone tailored for robust real‑time motion generation, effectively modeling a dataset of over 350,000 motion clips with a single model. Second, we introduce smart primitives that provide a unified, robust, and intuitive interface for authoring both navigation and object interaction. Applications can be designed in a plug‑and‑play manner like assembling bricks without expert animation knowledge. Quantitatively, we show that MotionBricks produces state‑of‑the‑art motion quality on open‑source and proprietary datasets of various scales, while also achieving a real‑time throughput of 15,000 FPS with 2ms latency. We demonstrate the flexibility and robustness of MotionBricks in a complete production‑level animation demo, covering navigation and object‑scene interaction across various styles with a unified model. To showcase our framework's application beyond animation, we deploy MotionBricks on the Unitree G1 humanoid robot to demonstrate its flexibility and generalization for real‑time robotic control.

Authors:Peng Liao, Peijia Zheng, Lingbo Li, Shangsong Liang, Lin Chen
Title: Intrinsic Mutual Information as a Modulator for Preference Optimization
Abstract:
Offline preference optimization methods, such as Direct Preference Optimization (DPO), offer significant advantages in aligning Large Language Models (LLMs) with human values. However, achieving optimal performance with these methods typically involves additional hyperparameter tuning, resulting in substantial time overhead. Although prior work has proposed a range of improvements, these methods remain limited in effectiveness and have not fully eliminated reliance on hyperparameter tuning. In this work, we propose RMiPO, a lightweight and efficient framework for offline preference optimization. RMiPO leverages intrinsic Response‑level Mutual information for Preference Optimization with hyperparameter modulation, dynamically decoupling preference contributions at negligible additional computational cost. Extensive experimental results demonstrate that RMiPO achieves consistently superior performance over existing methods while reducing training overhead by more than 15%. Our code is available at https://github.com/liavonpenn/rmipo.

Authors:Thomas Carmichael
Title: Architecture Determines Observability in Transformers
Abstract:
Autoregressive transformers make confident errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. This preservation is determined by architecture and training recipe. We define observability as the linear readability of per‑token decision quality from frozen mid‑layer activations after controlling for max‑softmax confidence and activation norm. The correction is essential. Confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families. Observability is not a generic property of transformers. In Pythia's controlled suite, every tested run with the 24‑layer, 16‑head configuration collapses to rho_partial ~0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band from 0.21 to 0.38. The output‑controlled residual collapses at the same points, and neither tested nonlinear probes nor layer sweeps recover healthy‑range signal. Checkpoint dynamics show the collapse is emergent during training. Both configurations at matched hidden dimension form the signal at the earliest measured checkpoint, but training erases it in the (24L, 16H) class while predictive loss continues improving. Across independent recipes the collapse map changes but the phenomenon persists. Qwen 2.5 and Llama differ by 2.9x at matched 3B scale with probe seed distributions that do not overlap, while Mistral 7B preserves observability where Llama 3.1 8B collapses despite similar broad architecture. A WikiText‑trained observer transfers to downstream QA without training on those tasks, catching errors confidence misses. At 20% flag rate, its exclusive catch rate is 10.9‑13.4% of all errors in seven of nine model‑task cells. Architecture selection is a monitoring decision.

Authors:Wenjie Du, Yiyuan Yang, Tianxiang Zhan, Qingsong Wen
Title: End-to-End Learning for Partially-Observed Time Series with PyPOTS
Abstract:
Partially‑observed time series (POTS) is ubiquitous in real‑world applications, yet most existing toolchains separate missing‑value handling from downstream learning, which limits reproducibility and overall performance. This tutorial introduces PyPOTS, an open‑source Python ecosystem for end‑to‑end data mining and machine learning on POTS. We present practical workflows spanning missingness simulation, data preprocessing, model training, and evaluation across core tasks, including imputation, forecasting, classification, clustering, and anomaly detection. The tutorial consists of two parts: Part I emphasizes hands‑on application for practitioners through unified APIs and benchmark‑oriented experiments. Part II targets developers and researchers, focusing on extending PyPOTS with custom models, domain‑specific constraints, and contribution‑ready engineering practices. Participants will gain both conceptual understanding and implementation experience for building robust, transparent, and reusable POTS pipelines in research and production settings. PyPOTS is publicly available at https://github.com/WenjieDu/PyPOTS

Authors:Hojoon Kim, Yuheng Wu, Thierry Tambe
Title: AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents
Abstract:
Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per‑step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per‑step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi‑agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache‑based plan reuse thus offers a practical path to low‑latency, low‑cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.

Authors:Yutong He, Zhengyang Huang, Jiahe Geng
Title: FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection
Abstract:
Federated learning enables a population of clients to collaboratively train machine learning models without exchanging their raw data, but standard algorithms such as FedAvg suffer from slow convergence and high communication and memory costs in heterogeneous, resource‑constrained environments. We introduce FedSLoP, a federated optimization algorithm that combines stochastic low‑rank subspace projections of gradients, thereby reducing the dimension of communicated and stored updates while preserving optimization progress. On the theoretical side, we develop a detailed nonconvex convergence analysis under standard smoothness and bounded‑variance assumptions, showing that FedSLoP is guaranteed to converge to a first‑order stationary point at a rate of O(1/\sqrtNT). On the empirical side, we conduct extensive experiments on federated MNIST classification with heterogeneous data partitions, showing that FedSLoP substantially reduces communication volume and client‑side memory while achieving competitive or better accuracy compared with FedAvg and representative sparse or low‑rank baselines. Together, our results demonstrate that random subspace momentum methods such as FedSLoP provide a principled and effective approach to communication‑ and memory‑efficient federated learning. Codes are available at: https://github.com/pkumelon/FedSLoP.git.

Authors:Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng
Title: TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
Abstract:
On‑policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain‑specific models to smaller students. While effective on static single‑turn tasks, its behavior in multi‑turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory‑Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter‑turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On‑Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student‑teacher pairs on three multi‑turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails. Our code is available at https://github.com/kokolerk/TCOD.

Authors:Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu
Title: Stabilizing Efficient Reasoning with Step-Level Advantage Selection
Abstract:
Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length‑based rewards or pruning, many approaches are post‑trained under a much shorter context window than base‑model training, a factor whose effect has not been systematically isolated. We first show that short‑context post‑training alone, using standard GRPO without any length‑aware objective, already induces substantial reasoning compression‑but at the cost of increasingly unstable training dynamics and accuracy degradation. To address this, we propose Step‑level Advantage Selection (SAS), which operates at the reasoning‑step level and assigns a zero advantage to low‑confidence steps in correct rollouts and to high‑confidence steps in verifier‑failed rollouts, where failures often arise from truncation or verifier issues rather than incorrect reasoning. Across diverse mathematical and general reasoning benchmarks, SAS improves average Pass@1 accuracy by 0.86 points over the strongest length‑aware baseline while reducing average reasoning length by 16.3%, yielding a better accuracy‑efficiency trade‑off.

Authors:Nicola Zanarini, Niccolò Ferrari
Title: Graph Memory Transformer (GMT)
Abstract:
We investigate whether the Feed‑Forward Network (FFN) sublayer in a decoder‑only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self‑attention intact, but replaces the usual per‑token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 128 edge matrix, gravitational source routing, token‑conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder‑only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M‑parameter dense GPT‑style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source‑to‑target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero‑shot benchmark behavior under the evaluated setting. These results are not intended as a state‑of‑the‑art claim; they support the viability and structural interpretability of replacing dense within‑token transformation with graph‑mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

Authors:Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee
Title: ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
Abstract:
Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present ELSA, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emphprovable \mathcalO(u\log n) FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid (m,S,W), yielding O(n) extra memory and O(\log n) parallel depth; and (iii)~is Tensor‑Core independent, implemented in Triton and CUDA C++, and deployable as a \emphdrop‑in replacement requiring no retraining or weight modification. Unlike FlashAttention‑2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource‑constrained edge devices such as Jetson TX2 ‑‑ making it the only hardware‑agnostic exact‑attention kernel that reduces parallel depth to O(\log n) at full precision. On A100 FP32 benchmarks (1K‑‑16K tokens), ELSA delivers 1.3‑‑3.5× speedup over memory‑efficient SDPA and 1.97‑‑2.27× on BERT; on Jetson TX2, ELSA achieves 1.5‑‑1.6× over Math (64‑‑900 tokens), with 17.8‑‑20.2% throughput gains under LLaMA‑13B offloading at \ge32K. In FP16, ELSA approaches hardware‑fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high‑precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.

Authors:Seongjin Choi
Title: Beyond coauthorship: semantic structure and phantom collaborators in transportation research, 1967--2025
Abstract:
We present a semantic‑structural atlas of transportation research built from 120,323 papers across 34 peer‑reviewed journals published between 1967 and 2025, roughly an order of magnitude larger than and a decade beyond Sun and Rahwan's~(2017) coauthorship study. We use OpenAlex and Crossref as open, CC0‑licensed data sources, resolve author identity through OpenAlex author IDs, ORCID records, and manual alias resolution, and embed every paper with SPECTER2 with Arora‑style whitening concatenated with concept TF‑‑IDF and venue linear‑discriminant projections. On this substrate we report three findings. First, Leiden on the author‑level semantic k‑nearest‑neighbor graph yields 23 topic communities that agree only weakly with the 172 coauthor communities (normalized mutual information 0.23), opening room for a predictive layer that neither source encodes alone. Second, a multiplex Leiden partition combining both edge types recovers 181 communities and localizes where collaboration and topic structure decouple. Third ‑‑ the paper's core methodological contribution ‑‑ we define \emphphantom collaborators, pairs of authors who are top‑K semantic neighbors yet \geq 3 hops apart in the coauthor graph, and show via a temporal hold‑out (training cutoff 2019) that phantom pairs become real coauthors in 2020‑‑2025 at a rate 16 to 33 times above random, popularity‑weighted, and same‑venue baselines, with a 68‑fold monotone gradient between the highest‑ and lowest‑similarity buckets. All artifacts are released as a live, reproducible web atlas at https://choi‑seongjin.github.io/transport‑atlas/.

Authors:Yuanming Shi, Andreas Haupt
Title: The Collapse of Heterogeneity in Silicon Philosophers
Abstract:
Silicon samples are increasingly used as a low‑cost substitute for human panels and have been shown to reproduce aggregate human opinion with high fidelity. We show that, in the alignment‑relevant domain of philosophy, silicon samples systematically collapse heterogeneity. Using data from N = 277 professional philosophers drawn from PhilPeople profiles, we evaluate seven proprietary and open‑source large language models on their ability to replicate individual philosophical positions and to preserve cross‑question correlation structures across philosophical domains. We find that language models substantially over‑correlate philosophical judgments, producing artificial consensus across domains. This collapse is associated in part with specialist effects, whereby models implicitly assume that domain specialists hold highly similar philosophical views. We assess the robustness of these findings by studying the impact of DPO fine‑tuning and by validating results against the full PhilPapers 2020 Survey (N = 1785). We conclude by discussing implications for alignment, evaluation, and the use of silicon samples as substitutes for human judgment. The code of this project can be found at https://github.com/stanford‑del/silicon‑philosophers.

Authors:Sifan Wang, Shawn Koohy, Yiping Lu, Paris Perdikaris
Title: When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions
Abstract:
Physics‑informed neural networks (PINNs) provide a promising machine learning framework for solving partial differential equations, but their training often breaks down on challenging problems, sometimes converging to physically incorrect solutions despite achieving small residual losses. This failure, we argue, is not merely an optimization difficulty. Rather, it reflects a fundamental weakness of the empirical PDE residual loss, which can admit trivial or spurious solutions during training. From this perspective, we revisit pseudo‑time stepping, a technique that has recently shown strong empirical success in PINNs. We show that its main benefit is not simply to ease optimization; instead, when combined with collocation‑point resampling, it helps reveal and avoid spurious solutions. At the same time, we find that the effectiveness of pseudo‑time stepping depends critically on the choice of step size, which cannot be tuned reliably from the training loss alone. To overcome this limitation, we propose an adaptive pseudo‑time stepping strategy that selects the step size from a finite‑difference surrogate of the local residual Jacobian, yielding the largest step permitted by local stability without per‑problem tuning. Across a diverse set of PDE benchmarks, the proposed method consistently improves both accuracy and robustness. Together, these findings provide a clearer understanding of why PINNs fail and suggest a practical pathway toward more reliable physics‑informed learning. All code and data accompanying this manuscript are available at https://github.com/sifanexisted/jaxpi2.

Authors:Lichen Li, Hengguang Zhou, Yijun Liang, Tianyi Zhou, Cho-Jui Hsieh
Title: Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
Abstract:
Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in‑the‑wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in‑the‑wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a "resampling‑until‑hack" mechanism. Through controlled comparisons between monitors trained on synthetic versus in‑the‑wild data, we find that (1) synthetic‑data‑trained monitors fail to generalize to "in‑the‑wild" hacking, and (2) monitors trained on our "in‑the‑wild" trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that synthetic reward hacking data may not fully reflect natural reward hacking behaviors, and that relying solely on synthetic data can lead to misleading conclusions. The codebase is available at https://github.com/LichenLillc/CoTMonitoring.git

Authors:Jainum Sanghavi
Title: From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers
Abstract:
Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT‑B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per‑patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5‑6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random‑weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re‑derived at each layer rather than passively carried in the residual stream, with mid‑layer interventions persisting most strongly downstream. The result is that a classification‑trained ViT develops an actively maintained spatial hierarchy that mirrors the early‑to‑late progression observed in the primate visual cortex.

Authors:Lucky Verma
Title: When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
Abstract:
Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime‑dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT‑2‑family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross‑checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500‑step saturation heuristic classifies DyT's sign with 75% raw in‑sample accuracy on the 12‑cell GPT‑2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave‑one‑scale‑out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT's penalty, and vanilla+dropout(p=0.5) matches DyT's data‑rich loss. We also localize Llama‑DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3‑seed component ablation (r=0.94). Scope: all experiments are compute‑limited (T/P < 1.84), below Chinchilla‑optimal training.

Authors:Emre Ardıç, Yakup Genç
Title: Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy
Abstract:
Federated learning (FL) is a distributed machine learning method where multiple devices collaboratively train a model under the management of a central server without sharing underlying data. One of the key challenges of FL is the communication bottleneck caused by variations in connection speed and bandwidth across devices. Therefore, it is essential to reduce the size of transmitted data during training. Additionally, there is a potential risk of exposing sensitive information through the model or gradient analysis during training. To address both privacy and communication efficiency, we combine differential privacy (DP) and adaptive quantization methods. We use Laplacian‑based DP to preserve privacy, which is relatively underexplored in FL and offers tighter privacy guarantees than Gaussian‑based DP. We propose a simple and efficient global bit‑length scheduler using round‑based cosine annealing, along with a client‑based scheduler that dynamically adapts based on client contribution estimated through dataset entropy analysis. We evaluate our approach through extensive experiments on CIFAR10, MNIST, and medical imaging datasets, using non‑IID data distributions across varying client counts, bit‑length schedulers, and privacy budgets. The results show that our adaptive quantization methods reduce total communicated data by up to 52.64% for MNIST, 45.06% for CIFAR10, and 31% to 37% for medical imaging datasets compared to 32‑bit float training while maintaining competitive model accuracy and ensuring robust privacy through differential privacy.

Authors:Hanna Rød, Dagny Streit, Nils Valseth Selte, Justin Li
Title: When Context Sticks: Studying Interference in In-Context Learning
Abstract:
This paper investigates context stickiness in in‑context learning (ICL), a phenomenon where earlier examples in a prompt interfere with a transformer's ability to adapt to later tasks. Using synthetic regression tasks over linear and quadratic functions, we examine how models trained under sequential, mixed, and random curricula handle abrupt task switches during inference. By sweeping over structured combinations of misleading linear examples followed by recovery quadratic examples, we quantify how prior context biases prediction error and how quickly models realign. Our results show strong evidence of persistent interference: more preceding linear examples reliably degrade quadratic predictions, while additional quadratic examples reduce error but with diminishing returns. We further find that training curricula significantly modulate resilience, with sequential training on the target function class yielding the fastest recovery, and surprisingly, random training producing the least robust behavior.

Authors:Jelena Ilić Vulićević
Title: An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code
Abstract:
Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud‑based models or specialized hardware, limiting practical applicability in privacy‑sensitive or resource‑constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real‑world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero‑shot prompting approach at the function level and an automated keyword‑based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.

Authors:Thibaud Southiratn, Bonil Koo, Yijingxiu Lu, Sun Kim
Title: CombiMOTS: Combinatorial Multi-Objective Tree Search for Dual-Target Molecule Generation
Abstract:
Dual‑target molecule generation, which focuses on discovering compounds capable of interacting with two target proteins, has garnered significant attention due to its potential for improving therapeutic efficiency, safety and resistance mitigation. Existing approaches face two critical challenges. First, by simplifying the complex dual‑target optimization problem to scalarized combinations of individual objectives, they fail to capture important trade‑offs between target engagement and molecular properties. Second, they typically do not integrate synthetic planning into the generative process. This highlights a need for more appropriate objective function design and synthesis‑aware methodologies tailored to the dual‑target molecule generation task. In this work, we propose CombiMOTS, a Pareto Monte Carlo Tree Search (PMCTS) framework that generates dual‑target molecules. CombiMOTS is designed to explore a synthesizable fragment space while employing vectorized optimization constraints to encapsulate target affinity and physicochemical properties. Extensive experiments on real‑world databases demonstrate that CombiMOTS produces novel dual‑target molecules with high docking scores, enhanced diversity, and balanced pharmacological characteristics, showcasing its potential as a powerful tool for dual‑target drug discovery. The code and data is accessible through https://github.com/Tibogoss/CombiMOTS.

Authors:Varun Totakura, Ankita Singh, Yushun Dong, Shayok Chakraborty
Title: An Analysis of Active Learning Algorithms using Real-World Crowd-sourced Text Annotations
Abstract:
Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data and tremendously reduce human annotation effort in inducing a machine learning model. In a conventional active learning setup, the labeling oracles are assumed to be infallible, that is, they always provide correct answers (in terms of class labels) to the queried unlabeled instances, which cannot be guaranteed in real‑world applications. To this end, a body of research has focused on the development of active learning algorithms in the presence of imperfect / noisy oracles. Existing research on active learning with noisy oracles typically simulate the oracles using machine learning models; however, real‑world situations are much more challenging, and using ML models to simulate the annotation patterns may not appropriately capture the nuances of real‑world annotation challenges. In this research, we first collect annotations of text samples (from 3 benchmark text classification datasets) from crowd‑sourced workers through a crowd‑sourcing platform. We then conduct extensive empirical studies of 8 commonly used active learning techniques (in conjunction with deep neural networks) using the obtained annotations. Our analyses sheds light on the performance of these techniques under real‑world challenges, where annotators can provide incorrect labels, and can also refuse to provide labels. We hope this research will provide valuable insights that will be useful for the deployment of deep active learning systems in real‑world applications. The obtained annotations can be accessed at https://github.com/varuntotakura/al_rcta/.

Authors:Bishwamittra Ghosh, Soumi Das, Till Speicher, Qinyuan Wu, Mohammad Aflah Khan, Deepak Garg, Krishna P. Gummadi, Evimaria Terzi
Title: Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
Abstract:
Large language models (LLMs) operate in two fundamental learning modes ‑ fine‑tuning (FT) and in‑context learning (ICL) ‑ raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task ‑ offering precise language boundaries, controlled string sampling, and no data contamination ‑ and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in‑language strings than to out‑of‑language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in‑distribution generalization, but both perform equally well on out‑of‑distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

Authors:Xinyue Zhang, Yuanhao Ding, Xiang Ao
Title: Follow the TRACE: Exploiting Post-Click Trajectories for Online Delayed Conversion Rate Prediction
Abstract:
Delayed feedback poses a core challenge for online CVR prediction, forcing a trade‑off between label accuracy and data freshness. Existing methods address this through delay modeling or sample reweighting, yet neglect how post‑click behaviors evolve over the observation period. To overcome this limitation, we formalize this evolution as feedback trajectory and propose TRACE. Instead of forcing hard labels on unrevealed samples, our method evaluates how well the accumulated feedback status aligns with conversion versus non‑conversion, dynamically refining posteriors without waiting for final outcomes. To counteract early‑stage trajectory sparsity, we further design a reliability‑gated retrospective completer that leverages full‑lifecycle data to provide adaptive posterior guidance for unrevealed samples. Extensive experiments validate TRACE's superiority over state‑of‑the‑art baselines and confirm the retrospective completion module as a model‑agnostic enhancer for existing systems. Our code is available at https://github.com/LunaZhangxy/TRACE.

Authors:Wugeng Zheng, Ziwen Kan, Katie Wang, Chen Chen, Song Wang
Title: Conditional Imputation for Within-Modality Missingness in Multi-Modal Federated Learning
Abstract:
Multimodal Federated Learning (MMFL) enables privacy‑preserving collaborative training, but real‑world clinical applications often suffer from within‑modality missingness caused by sensor intermittency or irregular sampling. Existing methods implicitly represent unobserved data via architectural alignment or missing embeddings, often failing to recover the true distribution and yielding sub‑optimal performance. We propose CondI, a federated framework explicitly addressing this missingness using conditional diffusion models. CondI employs a two‑phase training pipeline: first, imputing unobserved temporal components using available multimodal context and conditional embeddings; second, optimizing modality‑specific extractors and joint embedding spaces. During inference, imputed raw data pass through trained extractors to generate robust features, providing a holistic representation for downstream tasks. Explicit data imputation ensures models operate on complete semantic structures, significantly enhancing resilience against severe data incompleteness. Experiments on three clinical datasets (PTB‑XL, SLEEP‑EDF, MIMIC‑IV) demonstrate CondI achieves comparable results to state‑of‑the‑art baselines. Code: https://github.com/ZhengWugeng/CondI

Authors:Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian
Title: Mixture of Heterogeneous Grouped Experts for Language Modeling
Abstract:
Large Language Models (LLMs) based on Mixture‑of‑Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token‑level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system‑level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two‑level routing mechanism to enable flexible, resource‑aware expert combinations. To optimize inference efficiency, we propose a Group‑Wise Auxiliary Loss, which dynamically steers tokens to the most parameter‑efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All‑size Group‑decoupling Allocation strategy coupled with an Intra‑Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource‑efficient MoE design, offering a practical solution for optimizing inference costs in real‑world scenarios. The code is publicly available at https://github.com/UnicomAI/MoHGE.

Authors:Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang
Title: ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Abstract:
Evaluating generative AI models is increasingly resource‑intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre‑trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty‑aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre‑trained GP‑based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8‑65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

Authors:Rui Gao, Youngseung Jeon, Swastik Roy, Morteza Ziyadi, Xiang 'Anthony' Chen
Title: C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs
Abstract:
Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug‑design constraints remains challenging. We propose C‑Moral, a reinforcement learning post‑training framework for controllable multi‑objective molecular optimization. C‑Moral combines group‑based relative optimization, property score alignment for heterogeneous objectives, and continuous non‑linear reward aggregation to improve stability across competing properties. Experiments on the C‑MuMOInstruct benchmark show that C‑Moral consistently outperforms state‑of‑the‑art models across both in‑domain and out‑of‑domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post‑training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C‑MORAL.

Authors:Zixuan Xia, Quanxi Li
Title: K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
Abstract:
We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high‑variance returns and adapting to non‑stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on LunarLander and CartPole demonstrate that Kalman‑filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at https://github.com/Sumxiaa/Kalman_Normalization.

Authors:Jeremy Ellis
Title: On-Device Vision Training, Deployment, and Inference on a Thumb-Sized Microcontroller
Abstract:
This paper presents a complete, end‑to‑end on‑device vision machine learning pipeline, comprising data acquisition, two‑layer CNN training with Adam optimization, and real‑time inference, executing entirely on a microcontroller‑class device costing 15‑40 USD. Unlike cloud‑based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32‑S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three‑class 64x64 image classification in approximately 9 minutes per training run, with real‑time inference at 6.3 FPS. Key contributions include: correct batch‑level gradient accumulation; pre‑computed resize lookup tables for inference; dual‑format weight export for SD‑free baked‑in deployment; a three‑tier weight priority system (SD binary > baked‑in header > He‑initialization) resolved automatically at boot; a single‑constant network reconfiguration interface; and PSRAM‑aware memory management suited to microcontroller constraints. All source code and reference datasets are released under the MIT License at https://github.com/webmcu‑ai/on‑device‑vision‑ai

Authors:Rahul Patel
Title: AnemiaVision: Non-Invasive Anemia Detection via Smartphone Imagery Using EfficientNet-B3 with TrivialAugmentWide, Mixup Augmentation, and Persistent Patient History Management
Abstract:
Anemia affects over one billion people globally and remains severely under‑diagnosed in low‑resource regions where laboratory blood tests are inaccessible. This paper presents AnemiaVision, an end‑to‑end web‑based system for non‑invasive anemia screening from smartphone photographs of the palpebral conjunctiva and fingernail beds. The proposed pipeline fine‑tunes a pre‑trained EfficientNet‑B3 backbone with a redesigned three‑layer classifier head incorporating BatchNorm, GELU activations, and high‑rate Dropout (0.45/0.35). Training employs four orthogonal accuracy‑boosting techniques: TrivialAugmentWide for policy‑free image augmentation, RandomErasing for spatial regularisation, Mixup (alpha=0.2) for inter‑class smoothing, and cosine‑annealing scheduling with linear warmup. Early stopping is governed by peak validation accuracy rather than validation loss to prevent premature termination on high‑variance epochs. The deployed Flask application integrates persistent patient‑history management backed by PostgreSQL on Render, with an automated database‑migration entrypoint ensuring zero data loss across redeploys. Ablation experiments demonstrate that accuracy‑first early stopping contributes +1.6% and Mixup contributes +2.8% to final validation accuracy. Overall, the proposed system achieves a validation accuracy of 96.2% and AUC‑ROC of 0.98, compared with 44.9% validation accuracy and AUC‑ROC of 0.58 from the three‑epoch CPU‑only baseline. Sensitivity for the anemic class reaches 0.96, making the system suitable as a first‑line screening tool for community health workers in rural settings. The system is publicly accessible and source code is openly available.

Authors:Nikoo Moradi, Gijs Luijten, Behrus Hinrichs-Puladi, Jens Kleesiek, Victor Alves, Jan Egger, André Ferreira
Title: VS-DDPM: Efficient Low-Cost Diffusion Model for Medical Modality Translation
Abstract:
Diffusion models produce high‑quality synthetic data but suffer from slow inference. We propose 3D Variable‑Step Denoising Diffusion Probabilistic Model (VS‑DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI‑to‑sCT, and CBCT‑to‑sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS‑DDPM achieved state‑of‑the‑art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal‑to‑noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI‑to‑sCT and CBCT‑to‑sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post‑processing pipelines or specific loss function configurations. These results demonstrate that VS‑DDPM provides a robust and tunable solution for high‑fidelity 3D medical image synthesis. The code is available in https://github.com/andre‑fs‑ferreira/SynthRAD_by_Faking_it.

Authors:Dong Liu, Haisheng Wang, Yanxuan Yu
Title: Accelerating Frequency Domain Diffusion Models with Error-Feedback Event-Driven Caching
Abstract:
Diffusion models achieve remarkable success in time series generation. However, slow inference limits their practical deployment. We propose E^2‑CRF (Error‑Feedback Event‑Driven Cumulative Residual Feature caching) to accelerate frequency domain diffusion models. Our method exploits two structural properties: (1) spectral localization, where signal energy concentrates in low frequencies, and (2) mirror symmetry, which halves the effective frequency dimension. E^2‑CRF uses a closed‑loop error‑feedback system that adaptively caches transformer KV features across diffusion steps. We trigger recomputation using event‑driven residual dynamics instead of fixed schedules. Our method selectively recomputes high‑energy or rapidly‑changing tokens while reusing cached features for stable high‑frequency components. E^2‑CRF achieves ~2.2 speedup while maintaining sample quality. We demonstrate effectiveness on 5 datasets. Our caching strategy naturally aligns with the diffusion process's structure‑to‑detail progression. We include sufficient‑condition error and complexity bounds under standard regularity assumptions (Appendix), alongside empirical validation. Our code is available at https://github.com/NoakLiu/FastFourierDiffusion and is also integrated in https://github.com/NoakLiu/FastCache‑xDiT.

Authors:Chao Pan, Xin Yao
Title: FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
Abstract:
Fast Adversarial Training (FastAT) seeks to achieve adversarial robustness at a fraction of the computational cost incurred by standard multi‑step methods such as PGD‑AT. Although numerous FastAT techniques have been proposed in recent years, fair comparison among them remains elusive. Existing benchmarks and public leaderboards typically permit diverse model architectures, varying training configurations, and external data sources, making it unclear whether reported improvements reflect genuine algorithmic advances or merely more favorable experimental conditions. To address this problem, we introduce the FastAT Benchmark, a controlled evaluation framework built on three core design principles: unified architecture requirements, standardized training settings, and strict prohibition of external or synthetic data. The benchmark implements over twenty representative FastAT methods within a single codebase, enabling direct and reproducible comparison. Each method is assessed through a dual‑metric evaluation framework that measures both adversarial robustness (accuracy under PGD, AutoAttack, and CR Attack) and computational cost (GPU training time and peak memory footprint). Comprehensive experiments on CIFAR‑10, CIFAR‑100, and Tiny‑ImageNet provide reliable baseline measurements and reveal that well‑designed single‑step methods can match or surpass PGD‑AT robustness at substantially lower cost, while no single method dominates across all evaluation dimensions. The complete benchmark, including source code, configuration files, and experimental results, is publicly available to support transparent and fair evaluation of future FastAT research.

Authors:Jeremy Ellis
Title: WebSerial Vision Training for Microcontrollers: A Browser-Based Companion to On-Device CNN Training
Abstract:
This paper presents webmcu‑vision‑web, a single‑file, zero‑install browser application for end‑to‑end TinyML vision model training and deployment on the Seeed Studio XIAO ESP32‑S3 Sense (XIAO ML Kit, 15‑‑40 USD). Acting as a browser‑based companion to the on‑device Arduino firmware of Paper 1 [1], it provides a private, fully local machine learning pipeline, from firmware flashing through image collection, CNN training, weight export, and live activation visualization, without any software installation beyond a Chromium‑based browser. The system targets educators, small businesses, and researchers who need to train task‑specific visual classifiers under their exact deployment conditions. Key capabilities include: in‑browser firmware flashing via esptool‑js; an SD card file browser with image preview and inline editing; config.json live‑sync for zero‑recompile hyperparameter adjustment; webcam and ESP32 OV2640 camera image capture; TensorFlow.js CNN training completing a three‑class run (~30 images per class, 20 epochs) in approximately 1 minute browser‑side versus 9 minutes on‑device, enabling a complete collect‑train‑deploy cycle in under 10 minutes; weight export as myWeights.bin and myWeights.h; confusion matrix; and a live Conv2 activation heatmap streamed from the ESP32 during inference. No data leaves the local machine at any stage. A five‑run consistency evaluation on the three‑class reference problem (0Blank, 1Cup, 2Pen) demonstrates stable convergence with mean accuracy and standard deviation reported; all artefacts are released at the repository link below. The repository is a living template for LLM‑assisted adaptation to new hardware and tasks. All source code is MIT‑licensed at https://github.com/webmcu‑ai/webmcu‑vision‑web.

Authors:Liyao Jiang, Ruichen Chen, Keith G. Mills
Title: 2D Pre-Training for 3D Pose Estimation
Abstract:
Pre‑training is a general method that is used in a range of deep learning tasks. By first training a model on one task, and then further training on the downstream task used for final evaluation, the model is forced to learn a more general understanding of the input data. While pre‑training has been applied to 3D Human Pose Estimation (HPE) previously, the scope of datasets used is typically very limited to some strong benchmarks, like Human3.6M. Therefore, in this project, we expand the scope of an existing 3D HPE scheme to be compatible with additional 2D and 3D HPE datasets, like Occlusion Person. We perform an extensive study on how aspects of 2D pre‑training, such as model size, affect downstream performance, and to what extent pre‑training can help the model generalize to different datasets. Experimental results show that 2D pre‑training consistently outperforms training on 3D data alone, particularly in terms of computational efficiency. Finally, using MPII and Human3.6M, we are able to obtain an MPJPE score of under 64.5mm.

Authors:Rongxiao Guo, Qingchao Chen
Title: DGHMesh: A Large-scale Dual-radar mmWave Dataset and Generalization-focused Benchmark for Human Mesh Reconstruction
Abstract:
Millimeter‑wave (mmWave) radar has shown great potential for contactless, privacy‑preserving, and robust human sensing, yet existing mmWave‑based human mesh reconstruction (HMR) studies are still limited by the lack of benchmarks for generalization analysis under configuration shifts and fair comparison of different algorithms. To address the limitation, we present DGHMesh, a large‑scale dual‑radar mmWave dataset and generalization‑focused benchmark for HMR. It contains data from 15 subjects performing 8 actions, with 360,000 synchronized frames collected from FMCW radar, SFCW radar, RGB images, and high‑precision 3D HMR annotations. In addition, the dataset provides synchronized raw I/Q data from both radar modalities and accurately calibrated radar spatial positions. The benchmark is designed to evaluate HMR methods under diverse measurement configurations, including human position shifts, human orientation shifts, subarray size variations, and cross‑subject settings. Based on DGHMesh, we also propose mmPTM, a query‑based multi‑radar fusion framework that jointly exploits point clouds and imaging tubes for HMR. Extensive experiments are conducted against representative baselines under different settings. The results demonstrate that mmPTM consistently achieves outstanding accuracy and competitive generalization capability across multiple sub‑benchmarks, validating the effectiveness of multi‑radar fusion and the practical value of the proposed dataset and benchmark for mmWave‑based HMR research. DGHMesh and mmPTM are publicly available at https://github.com/SPIresearch/DGHMesh.(The complete benchmark and code will be released after paper publication)

Authors:Bayangmbe Mounmo, Sam Chien, Mile Mitrovic
Title: Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis
Abstract:
Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self‑supervised foundation model converting surface meshes into dense per‑token embeddings. Shape combines a structured 3D latent grid, a multi‑scale geometry‑aware tokenizer (MAGNO) with cross‑attention, and a transformer processor using grouped‑query attention and RMSNorm. A learned reconstruction prior enables per‑region attribution for explainable predictions. Pretraining uses masked‑token reconstruction of normalized geometry statistics and multi‑resolution contrastive consistency. The 10.9M‑parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held‑out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top‑1 retrieval under the Wang‑Isola protocol, with near‑zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target‑space normalization shows per‑dimension normalization is critical: without it, performance collapses (R2 < 0.14, top‑1 < 88%); with it, both losses succeed (R2 > 0.70, top‑1 > 96%). Smooth‑L1 offers secondary stability. Code, embeddings, and an interactive demo are released at https://github.com/simd‑ai/shape.

Authors:Ramit Pahwa, Apoorva Beedu, Parivesh Priye, Rutu Gandhi, Saloni Takawale, Aruna Baijal, Zengli Yang
Title: Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Abstract:
Voice assistants increasingly rely on Speech Language Models (SpeechLMs) to interpret spoken queries and execute complex tasks, yet existing benchmarks lack domain breadth, acoustic diversity, and compositional reasoning complexity to evaluate tool‑calling performance. We introduce Audio2Tool, a large‑scale dataset comprising approximately 30,000 queries designed to assess tool‑calling capabilities of SpeechLMs across three primary domains: Smart Car, Smart Home, and Wearables. Our benchmark features a multi‑tier complexity hierarchy, ranging from simple direct commands to complex multi‑intent and needle‑in‑a‑haystack extraction to isolate distinct failure modes. To ensure realism, we employ zero‑shot voice cloning text‑to‑speech synthesis and diverse noise profiles to simulate in‑the‑wild conditions. Evaluations of state‑of‑the‑art SpeechLMs and ASR‑LLM pipelines show strong performance on simple commands but significant degradation under compositional and acoustic challenges. Code and dataset are publicly available on the project page: https://audio2tool.github.io/.

Authors:Archit Thorat
Title: AutoCompress: Critical Layer Isolation for Efficient Transformer Compression
Abstract:
We present AutoCompress, a transformer compression method motivated by an empirical finding: in small transformers, Layer 0 carries disproportionately high task‑critical information, with an NTK‑based importance score of 3.6 compared to a maximum of 0.054 for all other layers ‑‑ a gap of over 60x. Based on this finding, we propose Critical Layer Isolation (CLI), an architecture that protects Layer 0 at full dimensionality, compresses all intermediate layers through a learned bottleneck, and restores the full dimension at the final layer. Applied to GPT‑2 Medium (354.8M parameters), CLI‑GPT2 achieves 204.5 perplexity on WikiText‑103 with only 143.8M parameters ‑‑ a 2.47x compression ratio and 59.5% parameter reduction. Crucially, an ablation study demonstrates that a uniform bottleneck baseline of comparable size achieves only 571.8 perplexity under identical training conditions, confirming that the architectural decision to protect Layer 0 ‑‑ rather than simply reducing model size ‑‑ is the primary driver of performance. Code and checkpoints are publicly available.

Authors:Sijie Li, Shanda Li, Haowei Lin, Weiwei Sun, Ameet Talwalkar, Yiming Yang
Title: Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Abstract:
Scaling laws are used to plan multi‑million‑dollar training runs, but fitting those laws can itself cost millions. In modern large‑scale workflows, assembling a sufficiently informative set of pilot experiments is already a major budget‑allocation problem rather than a routine preprocessing step. We formulate scaling‑law fitting as budget‑aware sequential experimental design: given a finite pool of runnable experiments with heterogeneous costs, choose which runs to execute so as to maximize extrapolation accuracy in a high‑cost target region. We then propose an uncertainty‑aware method for sequentially allocating experimental budget toward the runs most useful for target‑region extrapolation. Across a diverse benchmark of scaling‑law tasks, our method consistently outperforms classical design‑based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget. Our code is available at https://github.com/PlanarG/active‑sl.

Authors:Jose Geraldo Fernandes, Luiz Facury, Pedro Robles Dutenhefner, Wagner Meira
Title: Beyond Patient Invariance: Learning Cardiac Dynamics via Action-Conditioned JEPAs
Abstract:
Self‑supervised learning in healthcare has largely relied on invariance‑based objectives, which maximize similarity between different views of the same patient. While effective for static anatomy, this paradigm is fundamentally misaligned with clinical diagnosis, as it mathematically compels the model to suppress the transient pathological changes it is intended to detect. We propose a shift towards Action‑Conditioned World Models that learn to simulate the dynamics of disease progression, or Event‑Conditioned. Adapting the LeJEPA framework to physiological time‑series, we define pathology not as a static label, but as a transition vector acting on a patient's latent state. By predicting the future electrophysiological state of the heart given a disease onset, our model explicitly disentangles stable anatomical features from dynamic pathological forces. Evaluated on the MIMIC‑IV‑ECG dataset, our approach outperforms fully supervised baselines on the critical triage task. Crucially, we demonstrate superior sample efficiency: in low‑resource regimes, our world model outperforms supervised learning by over 0.05 AUROC. These results suggest that modeling biological dynamics provides a dense supervision signal that is far more robust than static classification. Source code is available at https://github.com/cljosegfer/lesaude‑dynamics

Authors:Isaac Tosin Adisa
Title: An Integrated Framework for Explainable, Fair, and Observable Hospital Readmission Prediction: Development and Validation on MIMIC-IV
Abstract:
Objective: To propose and retrospectively validate an integrated framework addressing three barriers to clinical translation of readmission prediction: lack of explainability, absence of deployment reliability infrastructure, and inadequate demographic fairness evaluation. Materials and Methods: We constructed a cohort of 415231 adult admissions from the MIMIC‑IV database (30‑day readmission prevalence 18.0%), split 70/15/15. Logistic regression, XGBoost, and LightGBM models were trained on 26 features. SHAP provided per‑patient explanations. Fairness was evaluated across 16 subgroups using AUC‑ROC, false negative rate (FNR), and positive predictive value (PPV). Calibration was assessed using Brier scores and calibration curves. Results: XGBoost achieved AUC‑ROC 0.696 (95% CI 0.691‑0.701), outperforming or matching the LACE baseline (AUC 0.60‑0.68). LightGBM achieved best calibration (Brier 0.146). Prior admissions were the dominant predictor. All subgroups met equity thresholds (delta AUC <= 0.05, delta FNR <= 0.10). Conclusion: This framework delivers competitive performance, clinically actionable explanations, and strong demographic equity. Code is publicly available at https://github.com/Tomisin92/readmission‑prediction.

Authors:Junjun Huang, Xiliang Lu, Xuelin Xie, Jerry Zhijian Yang
Title: Robust Fuzzy local k-plane clustering with mixture distance of hinge loss and L1 norm
Abstract:
K‑plane clustering (KPC), hyperplane clustering, and mixture regression all essentially fall within the same class of problems. This problem can be conceptualized as clustering in relatively high‑dimensional K subspaces or K linear manifolds. Traditional KPC or fuzzy KPC models demonstrate a pronounced susceptibility to outliers, as they presuppose that the projection distance between data points and the plane normal vector adheres to the L2 distance. Meanwhile, the assumption of infinitely extending clusters adversely affects clustering performance. To solve these problems, this paper proposed a new robust fuzzy local k‑plane clustering (RFLkPC) method that combines the mixture distance of hinge loss and L1 norm. The RFLkPC model assumes that each plane cluster is bounded to a finite area, which can flexibly and robustly handle plane clustering tasks with outliers or not. The corresponding model and optimization algorithms of RFLkPC were provided. Compared to other related models on this topic, a large number of experiments verify the efficiency of RFLkPC on simulated data and real data. The source code for the proposed RFLkPC method is publicly available at https://github.com/xuelin‑xie/RFLkPC.

Authors:Zhanli Wu, Fabrizio Leisen, Miguel-Angel Luque-Fernandez, F. Javier Rubio
Title: Conformalized Super Learner
Abstract:
The Super Learner (SL) is a widely used ensemble method that combines predictions from a library of learners based on their predictive performance. Interval predictions are of considerable practical interest because they allow uncertainty in predictions produced by an individual learner or an ensemble to be quantified. Several methods have been proposed for constructing interval predictions based on the SL, however, these approaches are typically justified using asymptotic arguments or rely on computationally intensive procedures such as the bootstrap. Conformal prediction (CP) is a machine learning framework for constructing prediction intervals with finite‑sample and asymptotic coverage guarantees under mild conditions. We propose coupling CP with the SL through a natural construction that mirrors the original SL framework, using individual learner weights and combining learner‑specific conformity scores via a weighted majority vote. We characterize the properties of the resulting SL‑based prediction intervals for continuous outcomes. We cover settings under exchangeability, potential violations of exchangeability, and data‑generating mechanisms exhibiting heteroscedasticity, sparsity, and other forms of distributional heterogeneity. A comprehensive simulation study shows that the conformalized SL achieves valid finite‑sample coverage with competitive performance relative to the true data‑generating mechanism. A central contribution of this work is an application to predicting creatinine levels using socio‑demographic, biometric, and laboratory measurements. This example demonstrates the benefits of an ensemble with carefully selected learners designed to capture key aspects of complex regression functions, including non‑linear effects, interactions, sparsity, heteroscedasticity, and robustness to outliers.R

Authors:Kang Liu, Jianchen Hu
Title: SOC-ICNN: From Polyhedral to Conic Geometry for Learning Convex Surrogate Functions
Abstract:
Classical ReLU‑based Input Convex Neural Networks (ICNNs) are equivalent to the optimal value functions of Linear Programming (LP). This intrinsic structural equivalence restricts their representational capacity to piecewise‑linear polyhedral functions. To overcome this representational bottleneck, we propose the SOC‑ICNN, an architecture that generalizes the underlying optimization class from LP to Second‑Order Cone Programming (SOCP). By explicitly injecting positive semi‑definite curvature and Euclidean norm‑based conic primitives, our formulation introduces native smooth curvature into the representation while preserving a rigorous optimization‑theoretic interpretation. We formally prove that SOC‑ICNNs strictly expand the representational space of ReLU‑ICNNs without increasing the asymptotic order of forward‑pass complexity. Extensive experiments demonstrate that SOC‑ICNN substantially improves function approximation, while delivering competitive downstream decision quality. The code is available at https://github.com/Kanyooo/SOC‑ICNN.

Authors:Chang Sun, Zhiqiang Que, Bakhtiar Zadeh, Qibin Liu, Kevin H. Alvarez, Wayne Luk, Maria Spiropulu
Title: HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference
Abstract:
Lookup‑table (LUT) based neural networks can deliver ultra‑low latency and excellent hardware efficiency on FPGAs by mapping arithmetic operations directly onto the logic primitives. However, state‑of‑the‑art LUT‑aware training (LAT) approaches remain difficult to use in practice: they are often orders of magnitude slower to train than conventional networks, require non‑trivial manual tuning for hardware efficiency, and lack an end‑to‑end workflow. This work presents HGQ‑LUT, integrated in https://github.com/calad0i/HGQ2, a new LAT approach that achieves state‑of‑the‑art hardware efficiency while accelerating training by over 100 times on modern GPUs. HGQ‑LUT introduces LUT‑Dense and LUT‑Conv layers that are implemented with regular, accelerator‑efficient tensor operations during training, which are then compiled into logic LUTs for hardware. By combining these layers with fine‑grained, element‑wise heterogeneous quantization (including zero‑bit pruning) and a LUT‑aware resource surrogate, HGQ‑LUT enables the automatic exploration of accuracy‑resource trade‑offs without manual bit‑width tuning. We further integrate HGQ‑LUT into open‑source toolchains, enabling unified design, compilation, and bit‑exact verification of hybrid architectures that mix LUT‑based with conventional arithmetic blocks. These features make LAT‑based DNNs practical for real‑world deployment, such as at the CERN Large Hadron Collider's experiments.

Authors:Zhancun Mu, Guangyu Zhao, Yiwu Zhong, Chi Zhang
Title: Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
Abstract:
One‑step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one‑step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint. If those two directions disagree, the loss resolves them as a compromise on that same sample, even when a nearby better action remains locally supported by the data. We propose DROL, a latent‑conditioned one‑step actor trained with top‑1 dynamic routing. For each state, the actor samples K candidate actions from a bounded latent prior, assigns each dataset action to its nearest candidate, and updates only that winner with Behavior Cloning and critic guidance. Because the routing is recomputed from the current candidate geometry, ownership of a supported region can shift across candidates over the course of learning. This gives a one‑step actor room to make local improvements that pointwise extraction struggles to capture, while retaining single‑pass inference at test time. On OGBench and D4RL, DROL is competitive with the one‑step FQL baseline, improving many OGBench task groups while remaining strong on both AntMaze and Adroit. Project page: https://muzhancun.github.io/preprints/DROL.

Authors:Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He
Title: Estimating Tail Risks in Language Model Output Distributions
Abstract:
Language models are increasingly capable and are being rapidly deployed on a population‑level scale. As a result, the safety of these models is increasingly high‑stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst‑case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute‑force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample‑efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute‑force Monte Carlo estimates using 10‑20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^‑4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare‑event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

Authors:Weiqiu You, Cassandra Goldberg, Amin Madani, Daniel A. Hashimoto, Eric Wong
Title: Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
Abstract:
Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision‑language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety‑critical surgical tasks. Methods: We introduce Sum‑of‑Checks, a framework that decomposes each CVS criterion into expert‑defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion‑level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain‑of‑thought, and sub‑question decomposition, each with and without few‑shot examples. Results: Sum‑of‑Checks improves average frame‑level mean average precision by 12‑‑14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision‑critical anatomical evidence. Conclusion: Structuring surgical reasoning into expert‑aligned verification checks improves both accuracy and transparency of LVLM‑based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision‑making is critical for reliable and auditable surgical AI systems. Code is available at https://github.com/BrachioLab/SumOfChecks.

Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Title: Where Should LoRA Go? Component-Type Placement in Hybrid Language Models
Abstract:
Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component‑type LoRA placement across two hybrid architectures ‑‑ Qwen3.5‑0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon‑H1‑0.5B (parallel, Mamba‑2 SSM + attention) ‑‑ fine‑tuned on three domains and evaluated on five benchmarks. We find that the attention pathway ‑‑ despite being the minority component ‑‑ consistently outperforms full‑model adaptation with 5‑10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (‑14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross‑task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component‑aware LoRA placement is a necessary design dimension for hybrid architectures.

Authors:Zhaohui Wang
Title: Who Audits the Auditor? Tamper-Proof Fraud Detection with Blockchain-Anchored Explainable ML
Abstract:
In enterprise fraud detection, model accuracy alone is insufficient when insiders can tamper with audit logs or bypass approval workflows. Real‑world incidents show that fraud often persists not because detection algorithms fail, but because the audit trail itself is controllable by privileged operators. This exposes a fundamental trust gap: who audits the auditor? We present a tamper‑evident fraud detection system that anchors both ML predictions and workflow execution to an immutable blockchain ledger. Rather than using blockchain as passive storage, we enforce the entire approval process through smart contracts, ensuring that every transaction, prediction, and explanation is atomically recorded and cannot be retroactively modified. Our detection module achieves competitive accuracy (F1 = 0.895, PR‑AUC = 0.974) while providing cryptographically verifiable decision trails that support regulatory auditability requirements (e.g., GDPR Article 22). System evaluation shows sub‑25 ms inference latency and economically viable deployment on Layer‑2 networks at under \0.01 per transaction (validated against PolygonScan data), supporting enterprise‑scale workloads of 10,000+ monthly payments.

Authors:Deepank Girish, Yi Hao Chan, Sukrit Gupta, Jing Xia, Jagath C. Rajapakse
Title: Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity
Abstract:
Several brain foundation models (FM) have recently been proposed to predict brain disorders by modelling dynamic functional connectivity (FC). While they demonstrate remarkable model performance and zero‑ or few‑shot generalization, the salient features identified as potential biomarkers are yet to be thoroughly evaluated. We propose RE‑CONFIRM, a framework for evaluating the robustness of potential biomarker candidates elucidated by deep learning (DL) models including FMs. From experiments on five large datasets of Autism Spectrum Disorder (ASD), Attention‑deficit Hyperactivity Disorder (ADHD), and Alzheimer's Disease (AD), we found that although commonly used performance metrics provide an intuitive assessment of model predictions, they are insufficient for evaluating the robustness of biomarkers identified by these models. RE‑CONFIRM metrics revealed that simply finetuning FMs leads to models that fail to capture regional hubs effectively, even in disorders where hubs are known to be implicated, such as ASD and ADHD. In view of this, we propose Hub‑LoRA (Low‑Rank Adaptation) as a fine‑tuning technique that enables FMs to not only outperform customised DL models but also produce neurobiologically faithful biomarkers supported by meta‑analyses. RE‑CONFIRM is generalizable and can be easily applied to ascertain the robustness of DL models trained on functional MRI datasets. Code is available at: https://github.com/SCSE‑Biomedical‑Computing‑Group/RE‑CONFIRM.

Authors:Grigory Sapunov
Title: Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Abstract:
We study learned memory tokens as a computational scratchpad for a single‑block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku‑Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non‑trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8‑32, 57.4% +/‑ 0.7% exact‑match) and a dilution boundary at T=64. Under halt‑side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero‑bias and Graves' recommended positive bias settle into a shallow halt equilibrium the model cannot escape. Inverting the bias to ‑3 ("deep start") eliminates the failure mode, and ablation shows the trap is inherent to ACT initialization rather than an artifact of our architecture. With reliable training, ACT yields an order of magnitude lower seed variance than fixed‑depth processing (+/‑0.7 vs +/‑9.3 pp); lambda warmup recovers 34% of compute at matched accuracy; and attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code: https://github.com/che‑shr‑cat/utm‑jax.

Authors:Charles Junichi McAndrews
Title: Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
Abstract:
Small language models (1‑3B) are practical to run locally, but individually limited on harder code generation tasks. We ask whether composing them into pipelines can recover some of that lost capability. We study code generation pipelines built from 1‑3B models with execution feedback, and use a NEAT‑inspired evolutionary search to test whether more complex pipeline structure helps beyond a simple refinement loop. We evaluate on HumanEval (164 problems) and sanitized MBPP (427 problems), all with local inference on a single laptop. Self‑refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The gains are narrow in mechanism: refinement fixes many runtime errors (especially NameError and SyntaxError), but rarely fixes logic errors such as AssertionError. Within our tested general‑purpose model pool, generator identity mattered less than refiner capability: a 1.5B generator paired with a 3B refiner matched a 3B model doing both roles. Early stopping is essential; without it, every iteration is net‑negative. The code‑specialized models outperform every general‑purpose pipeline configuration, suggesting model specialization matters more than pipeline architecture. Preliminary text‑only pipeline experiments without execution feedback did not show gains at this scale. In our constrained search space, evolutionary search mostly rediscovered the same simple generate‑execute‑refine loop we found manually, with no clearly significant gain from added topology. Single‑evaluation fitness inflates results by 5‑7 percent, selecting lucky genomes over good ones. On these benchmarks at 1‑3B scale, execution feedback mattered more than added pipeline complexity in determining whether composition helped.

Authors:Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
Title: When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Abstract:
Despite impressive progress in capabilities of large vision‑language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL‑DPO, a framework for fine‑tuning off‑the‑shelf LVLMs towards more visually grounded responses. HalluVL‑DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah‑kh.github.io/projects/prompts‑override‑vision/ .

Authors:Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao, Yuqi Zhu, Lun Du, Shumin Deng
Title: StructMem: Structured Memory for Long-Horizon Behavior in LLMs
Abstract:
Long‑term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi‑hop question answering. Current approaches face a fundamental trade‑off: flat memory is efficient but fails to model relational structure, while graph‑based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose StructMem, a structure‑enriched hierarchical memory framework that preserves event‑level bindings and induces cross‑event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi‑hop performance on \textttLoCoMo, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .

Authors:Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan
Title: Building a Precise Video Language with Human-AI Oversight
Abstract:
Video‑language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high‑quality captions, we introduce CHAI (Critique‑based Human‑AI Oversight), a framework where trained experts critique and revise model‑generated pre‑captions into improved post‑captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre‑ and post‑captions provide rich supervision for improving open‑source models (Qwen3‑VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference‑time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed‑source models such as Gemini‑3.1‑Pro. Finally, we apply our approach to re‑caption large‑scale professional videos (e.g., films, commercials, games) and fine‑tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human‑AI oversight are key to professional‑level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

Authors:Sukesh Subaharan
Title: Dynamical Priors as a Training Objective in Reinforcement Learning
Abstract:
Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP‑RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning. Across three minimal environments, we show that dynamical priors systematically alter decision trajectories in task‑dependent ways, promoting temporally structured behavior that cannot be explained by generic smoothing. These results demonstrate that training objectives alone can control the temporal geometry of decision‑making in RL agents.

Authors:Yixuan Zhu, Shilin Ma, Haolin Wang, Ao Li, Yanzhe Jing, Yansong Tang, Lei Chen, Jiwen Lu, Jie Zhou
Title: VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution
Abstract:
Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real‑world image super‑resolution (Real‑ISR). However, adapting VAR for ISR presents critical challenges. The next‑scale prediction mechanism, constrained by causal attention, fails to fully exploit global low‑quality (LQ) context, resulting in blurry and inconsistent high‑quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre‑trained text‑to‑image VAR model into a one‑step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross‑scale attention, which enables bidirectional scale‑wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine‑tuning only 1.2% of the model parameters through parameter‑efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state‑of‑the‑art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.

Authors:Wadii Boulila, Adel Ammar, Bilel Benjdira, Maha Driss
Title: Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning
Abstract:
Self‑supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations preserve semantic content. However, aerial images are frequently degraded by haze, motion blur, rain, and occlusion that remove critical evidence. Enforcing alignment between a clean and a severely degraded view can introduce spurious structure into the latent space. This study proposes a training strategy and architectural modification to enhance SSL robustness to such corruptions. It introduces a per‑sample, per‑factor trust weight into the alignment objective, combined with the base contrastive loss as an additive residual. A stop‑gradient is applied to the trust weight instead of a multiplicative gate. While a multiplicative gate is a natural choice, experiments show it impairs the backbone, whereas our additive‑residual approach improves it. Using a 200‑epoch protocol on a 210,000‑image corpus, the method achieves the highest mean linear‑probe accuracy among six backbones on EuroSAT, AID, and NWPU‑RESISC45 (90.20% compared to 88.46% for SimCLR and 89.82% for VICReg). It yields the largest improvements under severe information‑erasing corruptions on EuroSAT (+19.9 points on haze at s=5 over SimCLR). The method also demonstrates consistent gains of +1 to +3 points in Mahalanobis AUROC on a zero‑shot cross‑domain stress test using BDD100K weather splits. Two ablations (scalar uncertainty and cosine gate) indicate the additive‑residual formulation is the primary source of these improvements. An evidential variant using Dempster‑Shafer fusion introduces interpretable signals of conflict and ignorance. These findings offer a concrete design principle for uncertainty‑aware SSL. Code is publicly available at https://github.com/WadiiBoulila/trust‑ssl.

Authors:Yongcan Yu, Lingxiao He, Jian Liang, Kuangpu Guo, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He
Title: Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
Abstract:
Test‑time reinforcement learning (TTRL) always adapts models at inference time via pseudo‑labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group‑relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test‑time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency‑based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group‑relative policy optimization. Finally, DDRL incorporates a consensus‑based off‑policy refinement stage, which leverages the rejection‑sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.

Authors:Hieu Man, Van-Cuong Pham, Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen
Title: Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI
Abstract:
Learning robust representations of authorial style is crucial for authorship attribution and AI‑generated text detection. However, existing methods often struggle with content‑style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation‑by‑design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state‑of‑the‑art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI‑generated text detection, EAVAE excels in few‑shot learning over the M4 dataset. Code and data repositories are available online\footnotehttps://github.com/hieum98/avae \footnotehttps://huggingface.co/collections/Hieuman/document‑level‑authorship‑datasets.

Authors:Jon-Paul Cacioli
Title: Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding
Abstract:
Cacioli (2026) showed that the K‑way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log‑softmax margin. The reduction rests on five assumptions, including cross‑entropy (CE) at the output and effectively feedforward inference dynamics. This pre‑registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang & Bogacz, 2025). Across 10 seeds on CIFAR‑10 with a matched 2.1M‑parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = ‑0.082, p < 10^‑6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre‑registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe‑softmax gap (Delta_MSE = ‑0.037 vs Delta_stdPC = ‑0.082). CE is a major empirically load‑bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post‑hoc temperature scaling ablation decomposes the probe‑softmax gap into two components: approximately 66% is attributable to logit‑scale effects removable by temperature rescaling, and approximately 34% reflects a scale‑invariant ranking advantage of CE‑trained representations. We use "metacognitive" operationally to denote Type‑2 discrimination of a readout over its own Type‑1 correctness, not to imply human‑like introspective access.

Authors:Lars van der Laan, Mark Van Der Laan
Title: Calibeating Prediction-Powered Inference
Abstract:
We study semisupervised mean estimation with a small labeled sample, a large unlabeled sample, and a black‑box prediction model whose output may be miscalibrated. A standard approach in this setting is augmented inverse‑probability weighting (AIPW) [Robins et al., 1994], which protects against prediction‑model misspecification but can be inefficient when the prediction score is poorly aligned with the outcome scale. We introduce Calibrated Prediction‑Powered Inference, which post‑hoc calibrates the prediction score on the labeled sample before using it for semisupervised estimation. This simple step requires no retraining and can improve the original score both as a predictor of the outcome and as a regression adjustment for semisupervised inference. We study both linear and isotonic calibration. For isotonic calibration, we establish first‑order optimality guarantees: isotonic post‑processing can improve predictive accuracy and estimator efficiency relative to the original score and simpler post‑processing rules, while no further post‑processing of the fitted isotonic score yields additional first‑order gains. For linear calibration, we show first‑order equivalence to PPI++. We also clarify the relationship among existing estimators, showing that the original PPI estimator is a special case of AIPW and can be inefficient when the prediction model is accurate, while PPI++ is AIPW with empirical efficiency maximization [Rubin et al., 2008]. In simulations and real‑data experiments, our calibrated estimators often outperform PPI and are competitive with, or outperform, AIPW and PPI++. We provide an accompanying Python package, ppi_aipw, at https://larsvanderlaan.github.io/ppi‑aipw/.

Authors:Guilin Deng, Silong Chen, Yuchuan Luo, Yi Liu, Songlei Wang, Zhiping Cai, Lin Liu, Xiaohua Jia, Shaojing Fu
Title: Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
Abstract:
Federated Large Language Models (FedLLMs) enable multiple parties to collaboratively fine‑tune LLMs without sharing raw data, addressing challenges of limited resources and privacy concerns. Despite data localization, shared gradients can still expose sensitive information through membership inference attacks (MIAs). However, FedLLMs' unique properties, i.e. massive parameter scales, rapid convergence, and sparse, non‑orthogonal gradients, render existing MIAs ineffective. To address this gap, we propose ProjRes, the first projection residuals‑based passive MIA tailored for FedLLMs. ProjRes leverages hidden embedding vectors as sample representations and analyzes their projection residuals on the gradient subspace to uncover the intrinsic link between gradients and inputs. It requires no shadow models, auxiliary classifiers, or historical updates, ensuring efficiency and robustness. Experiments on four benchmarks and four LLMs show that ProjRes achieves near 100% accuracy, outperforming prior methods by up to 75.75%, and remains effective even under strong differential privacy defenses. Our findings reveal a previously overlooked privacy vulnerability in FedLLMs and call for a re‑examination of their security assumptions. Our code and data are available at \hrefhttps://anonymous.4open.science/r/Passive‑MIA‑5268link.

Authors:Amandeep Kaur, Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, Hannah Kerner
Title: Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance
Abstract:
New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per‑continent pretraining datasets and evaluated them on global and per‑continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent‑specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high‑performing pretraining dataset. We open‑sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner‑lab/pretrain‑where.

Authors:Anurita Das
Title: MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
Abstract:
Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load‑time per‑layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per‑layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5‑1.8x higher decode throughput than llama‑cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

Authors:Chao Pan, Yu Wu, Xin Yao
Title: SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
Abstract:
Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input‑level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system‑level override that defeats ISC by redirecting the model's task‑completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard‑stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML‑related ISC task types in the single‑turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared to 55.0% for the strongest viable baseline. Multi‑model ablation reveals that failure permission and condition specificity are universally critical, while the importance of other components varies across models. Cross‑attack evaluation confirms state‑of‑the‑art defense against ISC with generalization performance at least on par with the baseline on other attack families. Code is available at https://github.com/fzjcdt/SafeRedirect.

Authors:Yuzhen Mao, Michael Y. Li, Emily B. Fox
Title: Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
Abstract:
Scaling large language models to long contexts is challenging due to the quadratic computational cost of full attention. Mitigation approaches include KV‑cache selection or compression techniques. We instead provide an effective and end‑to‑end learnable bridge between the two without requiring architecture modification. In particular, our key insight is that interleaved gist compression tokens ‑‑ which provide a learnable summary of sets of raw tokens ‑‑ can serve as routing signals for sparse attention. Building on this, we introduce selective unfolding via GSA, which first compresses the context into gist tokens, then selects the most relevant gists, and subsequently restores the corresponding raw chunks for detailed attention. This yields a simple coarse‑to‑fine mechanism that combines compact global representations with targeted access to fine‑grained evidence. We further incorporate this process directly into training in an end‑to‑end fashion, avoiding the need for external retrieval modules. In addition, we extend the framework hierarchically via recursive gist‑of‑gist construction, enabling multi‑resolution context access with logarithmic per‑step decoding complexity. Empirical results on LongBench and RAG benchmarks demonstrate that our method consistently outperforms other compression baselines as well as inference‑time sparse attention methods across compression ratios from 8× to 32×. The code is available at: https://github.com/yuzhenmao/gist‑sparse‑attention/

Authors:Sina Gholami, Abdulmoneam Ali, Tania Haghighi, Ahmed Arafa, Minhaj Nur Alam
Title: FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels
Abstract:
Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi‑stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise‑tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class‑wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise‑aware training strategy that integrates logit‑adjusted loss, knowledge distillation, and distance‑aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state‑of‑the‑art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.

Authors:Shelly Golan, Michael Finkelson, Ariel Bereslavsky, Yotam Nitzan, Or Patashnik
Title: ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
Abstract:
Reinforcement Learning (RL) post‑training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade‑off point at training time, providing no inference‑time control over inherently conflicting goals ‑‑ such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi‑objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade‑offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state‑of‑the‑art flow‑matching backbones: SD3.5, FluxKontext, and LTX‑2. Our single preference‑conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade‑offs, while uniquely providing fine‑grained control over competing generative goals.

Authors:Dimitrije Antić, Alvaro Budria, George Paschalidis, Sai Kumar Dwivedi, Dimitrios Tzionas
Title: LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image
Abstract:
Reconstructing 3D Human‑Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill‑posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ‑VAE. We then develop LEXIS‑Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically‑plausible, proximity‑aware reconstructions without requiring post‑hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS‑Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding. Code & models will be public at https://anticdimi.github.io/lexis.

Authors:Grant Molnar
Title: A weighted angle distance on strings
Abstract:
We define a multi‑scale metric d_ρ on strings by aggregating angle distances between all n‑gram count vectors with exponential weights ρ^n. We benchmark d_ρ in DBSCAN clustering against edit and n‑gram baselines, give a linear‑time suffix‑tree algorithm for evaluation, prove metric and stability properties (including robustness under tandem‑repeat stutters), and characterize isometries.

Authors:Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, Jeff Schneider
Title: Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
Abstract:
The temporal lag between actions and their long‑term consequences makes credit assignment a challenge when learning goal‑directed behaviors from data. Generative world models capture the distribution of future states an agent may visit, indicating that they have captured temporal information. How can that temporal information be extracted to perform credit assignment? In this paper, we formalize how the temporal information stored in world models encodes the underlying geometry of the world. Leveraging optimal transport, we extract this geometry from a learned model of the occupancy measure into a reward function that captures goal‑reaching information. Our resulting method, Occupancy Reward Shaping, largely mitigates the problem of credit assignment in sparse reward settings. ORS provably does not alter the optimal policy, yet empirically improves performance by 2.2x across 13 diverse long‑horizon locomotion and manipulation tasks. Moreover, we demonstrate the effectiveness of ORS in the real world for controlling nuclear fusion on 3 Tokamak control tasks. Code: https://github.com/aravindvenu7/occupancy_reward_shaping; Website: https://aravindvenu7.github.io/website/ors/

Authors:Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim
Title: DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
Abstract:
Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human‑verified benchmark built from natural human dialogue using a multiple‑choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting ‑‑ probing whether models can identify state‑consistent dialogue trajectories solely from mental‑state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM‑generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth‑py/DialToM.

Authors:Huaiyu Jia, Jiehshun You, Yizhi Luo, Jingyu Liu, Shuo Sun
Title: Towards Event-Aware Forecasting in DeFi: Insights from On-chain Automated Market Maker Protocols
Abstract:
Automated Market Makers (AMMs), as a core infrastructure of decentralized finance (DeFi), uniquely drive on‑chain asset pricing through a deterministic reserve ratio mechanism. Unlike traditional markets, AMM price dynamics is triggered largely by on‑chain events (e.g., swap) that change the reserve ratio, rather than by continuous responses to off‑chain information. This makes event‑level analysis crucial for understanding price formation mechanisms in AMMs. However, existing research generally neglects the micro‑structural dynamics at the AMMs level, lacking both a comprehensive dataset covering multiple protocols with fine‑grained event classification and an effective framework for event‑aware modeling. To fill this gap, we construct a dataset containing 8.9 million on‑chain event records from four representative AMMs protocols: Pendle, Uniswap v3, Aave and Morpho, with precise annotations of transaction type and block height timestamps. Furthermore, we propose an Uncertainty Weighted Mean Squared Error (UWM) loss function, which incorporates the block interval regression term into the traditional Time‑Point Process (TPP) objective function by weighting the uncertainty with homoscedasticity. Extensive experiments on eight advanced TPP architectures demonstrate that this loss function reduces the time prediction error by an average of 56.41% while maintaining the accuracy of event type prediction, establishing a robust benchmark for event‑aware prediction in the AMMs ecosystem. This work provides the necessary data foundation and methodological framework for modeling the discreteness and event‑driven characteristics of on‑chain price discovery. All datasets and source code are publicly available. https://github.com/yosen‑king/Deep‑AMM‑Events

Authors:Zhenyu Wang, Geyan Ye, Wei Liu, Man Tat Alexander Ng
Title: AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling
Abstract:
Virtual cell modeling predicts molecular state changes under genetic perturbations in silico, which is essential for biological mechanism studies. However, existing approaches suffer from unconstrained reasoning, uninterpretable predictions, and retrieval signals that are weakly aligned with regulatory topology. To address these limitations, we propose AROMA, an Augmented Reasoning Over a Multimodal Architecture for virtual cell genetic perturbation modeling. AROMA integrates textual evidence, graph‑topology information, and protein sequence features to model perturbation‑target dependencies, and is trained with a two‑stage optimization strategy to yield predictions that are both accurate and interpretable. We also construct two knowledge graphs and a perturbation reasoning dataset, PerturbReason, containing more than 498k samples, as reusable resources for the virtual cell domain. Experiments show that AROMA outperforms existing methods across multiple cell lines, and remains robust under zero‑shot evaluation on an unseen cell line, as well as in knowledge‑sparse, long‑tail scenarios. Overall, AROMA demonstrates that combining knowledge‑driven multimodal modeling with evidence retrieval provides a promising pathway toward more reliable and interpretable virtual cell perturbation prediction. Model weights are available at https://huggingface.co/blazerye/AROMA. Code is available at https://github.com/blazerye/AROMA.

Authors:Wengyu Zhang, Xiao-Yong Wei, Qing Li
Title: Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design
Abstract:
Text‑guided molecular design is a key capability for AI‑driven drug discovery, yet it remains challenging to map sequential natural‑language instructions with non‑linear molecular structures under strict chemical constraints. Most existing approaches, including RAG, CoT prompting, and fine‑tuning or RL, emphasize a small set of ad‑hoc reasoning perspectives implemented in a largely one‑shot generation pipeline. In contrast, real‑world drug discovery relies on dynamic, multi‑perspective critique and iterative refinement to reconcile semantic intent with structural feasibility. Motivated by this, we propose Mol‑Debate, a generation paradigm that enables such dynamic reasoning through an iterative generate‑debate‑refine loop. We further characterize key challenges in this paradigm and address them through perspective‑oriented orchestration, including developer‑debater conflict, global‑local structural reasoning, and static‑dynamic integration. Experiments demonstrate that Mol‑Debate achieves state‑of‑the‑art performance against strong general and chemical baselines, reaching 59.82% exact match on ChEBI‑20 and 50.52% weighted success rate on S^2‑Bench. Our code is available at https://github.com/wyuzh/Mol‑Debate.

Authors:Rongtao Zhang, Xin Zhu, Masoume Pourebadi Khotbehsara, Warren Dao, Erdem Bıyık, Heather Culbertson
Title: Vibrotactile Preference Learning: Uncertainty-Aware Preference Learning for Personalized Vibration Feedback
Abstract:
Individual differences in vibrotactile perception underscore the growing importance of personalization as haptic feedback becomes more prevalent in interactive systems. We propose Vibrotactile Preference Learning (VPL), a system that captures user‑specific preference spaces over vibrotactile parameters via Gaussian‑process‑based uncertainty‑aware preference learning. VPL uses an expected information gain‑based acquisition strategy to guide query selection over 40 rounds of pairwise comparisons of overall user preference, augmented with user‑reported uncertainty, enabling efficient exploration of the parameter space. We evaluate VPL in a user study (N = 13) using the vibrotactile feedback from a Microsoft Xbox controller, showing that it efficiently learns individualized preferences while maintaining comfortable, low‑workload user interactions. These results highlight the potential of VPL for scalable personalization of vibrotactile experiences.

Authors:Mobin Habibpour, Niloufar Alipour Talemi, John Spodnik, Camren J. Khoury, Fatemeh Afghah
Title: WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring
Abstract:
Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire‑specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large‑scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB‑thermal samples, where each sample includes an RGB image, a color‑mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple‑choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross‑modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)‑based answer generation with sensor‑driven deterministic labeling, manual verification, and intra‑frame and inter‑frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval‑augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature‑grounded reasoning and the limitations of existing MLLMs in safety‑critical wildfire scenarios. The dataset and benchmark code are open‑source at https://github.com/mobiiin/WildFire_VQA.

Authors:Boxin Zhao, Mladen Kolar, Jinchi Lv
Title: SMART: A Spectral Transfer Approach to Multi-Task Learning
Abstract:
Multi‑task learning is effective for related applications, but its performance can deteriorate when the target sample size is small. Transfer learning can borrow strength from related studies; yet, many existing methods rely on restrictive bounded‑difference assumptions between the source and target models. We propose SMART, a spectral transfer method for multi‑task linear regression that instead assumes spectral similarity: the target left and right singular subspaces lie within the corresponding source subspaces and are sparsely aligned with the source singular bases. Such an assumption is natural when studies share latent structures and enables transfer beyond the bounded‑difference settings. SMART estimates the target coefficient matrix through structured regularization that incorporates spectral information from a source study. Importantly, it requires only a fitted source model rather than the raw source data, making it useful when data sharing is limited. Although the optimization problem is nonconvex, we develop a practical ADMM‑based algorithm. We establish general, non‑asymptotic error bounds and a minimax lower bound in the noiseless‑source regime. Under additional regularity conditions, these results yield near‑minimax Frobenius error rates up to logarithmic factors. Simulations confirm improved estimation accuracy and robustness to negative transfer, and analysis of multi‑modal single‑cell data demonstrates better predictive performance. The Python implementation of SMART, along with the code to reproduce all experiments in this paper, is publicly available at https://github.com/boxinz17/smart.

Authors:Xi Chen, Arian Maleki, Shirin Jalali
Title: Maximum Likelihood Reconstruction for Multi-Look Digital Holography with Markov-Modeled Speckle Correlation
Abstract:
Multi‑look acquisition is a widely used strategy for reducing speckle noise in coherent imaging systems such as digital holography. By acquiring multiple measurements, speckle can be suppressed through averaging or joint reconstruction, typically under the assumption that speckle realizations across looks are statistically independent. In practice, however, hardware constraints limit measurement diversity, leading to inter‑look correlation that degrades the performance of conventional methods. In this work, we study the reconstruction of speckle‑free reflectivity from complex‑valued multi‑look measurements in the presence of correlated speckle. We model the inter‑look dependence using a first‑order Markov process and derive the corresponding likelihood under a first‑order Markov approximation, resulting in a constrained maximum likelihood estimation problem. To solve this problem, we develop an efficient projected gradient descent framework that combines gradient‑based updates with implicit regularization via deep image priors, and leverages Monte Carlo approximation and matrix‑free operators for scalable computation. Simulation results demonstrate that the proposed approach remains robust under strong inter‑look correlation, achieving performance close to the ideal independent‑look scenario and consistently outperforming methods that ignore such dependencies. These results highlight the importance of explicitly modeling inter‑look correlation and provide a practical framework for multi‑look holographic reconstruction under realistic acquisition conditions. Our code is available at: https://github.com/Computational‑Imaging‑RU/MLE‑Holography‑Markov.

Authors:Natalia Martinez Gil, Fearghal O'Donncha, Wesley M. Gifford, Nianjun Zhou, Dhaval C. Patel, Roman Vaculin
Title: Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring
Abstract:
We propose a post‑hoc adaptive conformal anomaly detection method for monitoring time series that leverages predictions from pre‑trained foundation models without requiring additional fine‑tuning. Our method yields an interpretable anomaly score directly interpretable as a false alarm rate (p‑value), facilitating transparent and actionable decision‑making. It employs weighted quantile conformal prediction bounds and adaptively learns optimal weighting parameters from past predictions, enabling calibration under distribution shifts and stable false alarm control, while preserving out‑of‑sample guarantees. As a model‑agnostic solution, it integrates seamlessly with foundation models and supports rapid deployment in resource‑constrained environments. This approach addresses key industrial challenges such as limited data availability, lack of training expertise, and the need for immediate inference, while taking advantage of the growing accessibility of time series foundation models. Experiments on both synthetic and real‑world datasets show that the proposed approach delivers strong performance, combining simplicity, interpretability, robustness, and adaptivity.

Authors:Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, Chenyan Xiong
Title: SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
Abstract:
Skills have become the de facto way to enable LLM agents to perform complex real‑world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill‑dependent tasks across 15 sub‑domains derived from a real‑world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one‑shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no‑skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open‑ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self‑feedback alone induces recursive drift. Our data and code are open‑source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.

Authors:Zehuan Zhang, Mark Chen, He Li, Wayne Luk
Title: Algorithm and Hardware Co-Design for Efficient Complex-Valued Uncertainty Estimation
Abstract:
Complex‑Valued Neural Networks (CVNNs) have significant advantages in handling tasks that involve complex numbers. However, existing CVNNs are unable to quantify predictive uncertainty. We propose, for the first time, dropout‑based Bayesian Complex‑Valued Neural Networks (BayesCVNNs) to enable uncertainty quantification for complex‑valued applications, exhibiting broad applicability and efficiency for hardware implementation due to modularity. Furthermore, as the dual‑part nature of complex values significantly broadens the design space and enables novel configurations based on layer‑mixing and part‑mixing, we introduce an automated search approach to effectively identify optimal configurations for both real and imaginary components. To facilitate deployment, we present a framework that generates customized FPGA‑based accelerators for BayesCVNNs, leveraging a set of optimized building blocks. Experiments demonstrate the best configuration can be effectively found via the automated search, attaining higher performance with lower hardware costs compared with manually crafted models. The optimized accelerators achieve approximately 4.5x and 13x speedups on different models with less than 10% power consumption compared to GPU implementations, and outperform existing work in both algorithm and hardware aspects. Our code is publicly available at: https://github.com/zehuanzhang/BayesCVNN.git.

Authors:SLAM Labs, :, Oleksiy Ostapenko, Raymond Li, Torsten Scholak, Alireza Mousavi-Hosseini, Aman Tiwari, Denis Kocetkov, Joel Lamy Poirier, Kelechi Ogueji, Nanda H Krishna, Rafael Pardinas, Sathwik Tejaswi Madhusudhan, Shruthan Radhakrishna, Srinivas Sunkara, Valerie Becaert
Title: Super Apriel: One Checkpoint, Many Speeds
Abstract:
We release Super Apriel, a 15B‑parameter supernet in which every decoder layer provides four trained mixer choices ‑‑ Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all‑FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span 2.9× to 10.7× decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per‑layer mixer assignment makes the speed‑quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine‑tuning. We release the supernet weights, Fast‑LLM training code, vLLM serving code, and a placement optimization toolkit.

Authors:Jason Z Wang
Title: MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models
Abstract:
We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self‑knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report explicitly marked model subsets. We find two phenomena with direct implications for agentic deployment: (1) compositional self‑prediction fails universally ‑‑ the Compositional Calibration Error ranges from 0.500 to 0.943 on the original 15‑model Exp3‑v1 set (and 0.434 to 0.758 on the balanced 16‑model Exp3‑v2 expansion), indicating that models cannot predict their own performance on multi‑domain tasks, and (2) models exhibit above‑chance but imperfect domain‑specific self‑knowledge yet systematically fail to translate even this partial awareness into appropriate agentic action‑selection ‑‑ external metacognitive control reduces the Confident Failure Rate from 0.600 to 0.143 (76% reduction at temperature 0; mean 70% at temperature 0.7 across 5 models from 4 labs). Providing models with their own calibration scores produces no significant improvement (p > 0.05); only architectural constraint is effective. This suggests that external metacognitive scaffolding ‑‑ not improved self‑knowledge ‑‑ is the path to safer autonomous AI systems. Code, data, and Croissant metadata will be released publicly with the benchmark.

Authors:Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal
Title: Generalization at the Edge of Stability
Abstract:
Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

Authors:Perry Dong, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn
Title: FASTER: Value-Guided Sampling for Fast RL
Abstract:
Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test‑time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling‑based test‑time scaling of diffusion‑based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long‑horizon manipulation tasks in online and batch‑online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

Authors:Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu
Title: VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
Abstract:
We present VLA Foundry, an open‑source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open‑source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end‑to‑end control, from language pretraining to action‑expert fine‑tuning. VLA Foundry supports both from‑scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM‑‑>VLM‑‑>VLA pipeline and the second built on the pretrained Qwen3‑VL backbone. We evaluate closed‑loop policy performance of both models on LBM Eval, an open‑data, open‑source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully‑open from‑scratch model is on par with our prior closed‑source work and substituting in the Qwen3‑VL backbone leads to a strong multi‑task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI‑ML/vla_foundry and all multi‑task model weights are released on https://huggingface.co/collections/TRI‑ML/vla‑foundry. Additional qualitative videos are available on the project website https://tri‑ml.github.io/vla_foundry.

Authors:Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang
Title: Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
Abstract:
At present, executable visual workflows have emerged as a mainstream paradigm in real‑world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve‑making development costly, time‑consuming, and error‑prone. To study whether large language models can automate this multi‑round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real‑world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state‑of‑the‑art language models can often capture high‑level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real‑world gap positions Chat2Workflow as a foundation for advancing industrial‑grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

Authors:Quang-Huy Nguyen, Thanh-Hai Nguyen, Khac-Manh Thai, Duc-Hoang Pham, Huy-Son Nguyen, Cam-Van Thi Nguyen, Masoud Mansoury, Duc-Trong Le, Hoang-Quynh Le
Title: From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems
Abstract:
Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user‑item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re‑implement, and re‑evaluate eleven state‑of‑the‑art CE methods for recommender systems, covering both native explainers (e.g., LIME‑RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph‑based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item‑level vs. list‑level), and perturbation scope (user interaction vectors vs. user‑item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item‑level assessments to top‑K list‑level explanations. Through extensive experiments on three real‑world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade‑off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph‑based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R‑UET/CFExpRec.

Authors:Ghadah Alosaimi, Hanadi Alhamdan, Wenke E, Stamos Katsigiannis, Amir Atapour-Abarghouei, Toby P. Breckon
Title: Mind2Drive: Predicting Driver Intentions from EEG in Real-world On-Road Driving
Abstract:
Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real‑world driving due to EEG signal non‑stationarity and the complexity of cognitive‑motor preparation. This study proposes and evaluates an EEG‑based driver intention prediction framework using a synchronised multi‑sensor platform integrated into a real electric vehicle. A real‑world on‑road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro‑F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact‑handling pipelines, and prediction performance peaks within a 400‑600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG‑based driver intention decoding under real‑world on‑road conditions. Code: https://github.com/galosaimi/Mind2Drive.

Authors:Yilun Liu, Ruihong Qiu, Zi Huang
Title: TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
Abstract:
Zero‑shot reasoning on text‑rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task‑specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)‑based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN‑R1‑Zero, a post‑training framework for TRN reasoning trained solely via reinforcement learning. TRN‑R1‑Zero directly optimises base LLMs using a Neighbour‑aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN‑R1‑Zero requires no supervised fine‑tuning or chain‑of‑thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co‑purchase TRN benchmarks demonstrate the superiority and robustness of TRN‑R1‑Zero. Moreover, relying strictly on node‑level training, TRN‑R1‑Zero achieves zero‑shot inference on edge‑ and graph‑level tasks, extending beyond cross‑domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN‑R1‑Zero.

Authors:Julian Skifstad, Xinyue Annie Yang, Glen Chou
Title: Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Abstract:
Inference‑time LLM alignment methods, particularly activation steering, offer an alternative to fine‑tuning by directly modifying activations during generation. Existing methods, however, often rely on non‑anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open‑loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer‑wise dynamics across multiple LLM architectures and scales are well‑approximated by locally‑linear models. Exploiting this property, we model LLM inference as a linear time‑varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer‑wise Jacobians, steering activations toward desired semantic setpoints in closed‑loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine‑grained behavior control across models, scales, and tasks, including state‑of‑the‑art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr‑activation‑steering

Authors:Ehsan Hoseinzade, Ke Wang, Anandharaju Durai Raju
Title: TabEmb: Joint Semantic-Structure Embedding for Table Annotation
Abstract:
Table annotation is crucial for making web and enterprise tables usable in downstream NLP applications. Unlike textual data where learning semantically rich token or sentence embeddings often suffice, tables are structured combinations of columns wherein useful representations must jointly capture column's semantics and the inter‑column relationships. Existing models learn by linearizing the 2D table into a 1D token sequence and encoding it with pretrained language models (PLMs) such as BERT. However, this leads to limited semantic quality and weaker generalization to unseen or rare values compared to modern LLMs, and degraded structural modeling due to 2D‑to‑1D flattening and context‑length constraints. We propose TabEmb, which directly targets these limitations by decoupling semantic encoding from structural modeling. An LLM first produces semantically rich embeddings for each column, and a graph‑based module over columns then injects relationships into the embeddings, yielding joint semantic‑tructural representations for table annotation. Experiments show that TabEmb consistently outperforms strong baselines on different table annotation tasks. Source code and datasets are available at https://github.com/hoseinzadeehsan/TabEmb

Authors:Chih-Yu Chang, Qiyuan Chen, Tianhan Gao, David Fenning, Chinedum Okwudire, Neil Dasgupta, Wei Lu, Raed Al Kontar
Title: Collaborative Contextual Bayesian Optimization
Abstract:
Discovering optimal designs through sequential data collection is essential in many real‑world applications. While Bayesian Optimization (BO) has achieved remarkable success in this setting, growing attention has recently turned to context‑specific optimal design, formalized as Contextual Bayesian Optimization (CBO). Unlike BO, CBO is inherently more challenging as it must approximate an entire mapping from the context space to its corresponding optimal design, requiring simultaneous exploration across contexts and exploitation within each. In many modern applications, such tasks arise across multiple potentially heterogeneous but related clients, where collaboration can significantly improve learning efficiency. We propose CCBO, Collaborative Contextual Bayesian Optimization, a unified framework enabling multiple clients to jointly perform CBO with controllable contexts, supporting both online collaboration and offline initialization from peers' historical beliefs, with an optional privacy‑preserving communication mechanism. We establish sublinear regret guarantees and demonstrate, through extensive simulations and a real‑world hot rolling application, that CCBO achieves substantial improvements over existing approaches even under client heterogeneity. The code to reproduce the results can be found at https://github.com/cchihyu/Collaborative‑Contextual‑Bayesian‑Optimization

Authors:Isaac Llorente-Saguer
Title: Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Abstract:
Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama‑3.2, Gemma‑3) and three alignment variants (base, instruction‑tuned, abliterated), under single‑turn, English evaluation, we characterise this geometry through six direction‑finding strategies. Three succeed: a soft‑AUC‑optimised linear direction reaches mean AUROC 0.98 and TPR@1%FPR 0.80; a class‑mean probe reaches 0.98 and 0.71 at <1ms fitting cost; a supervised angular‑deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction (73^\circ from projection‑based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held‑out HarmBench and JailbreakBench with worst‑case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains \geq0.98 and cross‑variant transfer stays within 0.018 of own‑direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@1%FPR should accompany AUROC in safety‑adjacent evaluation.

Authors:Manuel Israel Cazares
Title: Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Abstract:
We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas ‑‑ a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt‑oss‑120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single‑prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60‑‑79% for gpt‑oss‑120b, compared to a 59.75% no‑cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non‑monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage‑point improvement over the no‑cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair‑prompt‑engineering

Authors:Benjamin K. Johnson, Thomas Goralski, Ayush Semwal, Hui Shen, H. Josh Jang
Title: Streaming Structured Inference with Flash-SemiCRF
Abstract:
Semi‑Markov Conditional Random Fields (semi‑CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference over segment‑level features and principled uncertainty estimates at their boundaries. However, existing implementations must materialize a large edge potential tensor whose size grows with sequence length, maximum segment length, and label count, becoming prohibitive for speech‑scale state spaces and intractable at genomic scales where sequences can exceed 100,000 positions. This memory bottleneck has limited the adoption of exact segment‑level inference for long sequences and large label sets. We identify that the core inefficiency is materializing edge potentials that can instead be evaluated on‑the‑fly from a compact prefix‑sum array, and make several improvements. First, replacing the stored edge tensor with prefix‑sum lookup reduces the memory footprint by a factor proportional to the product of segment length and label count. Second, a streaming forward‑backward pass with checkpoint‑boundary normalization keeps working memory sublinear in sequence length while preserving exact gradients. Third, zero‑centered cumulative scores control numerical drift and induce an adaptive duration prior under label imbalance. We integrate these ideas into Flash‑SemiCRF, a fused Triton kernel that enables exact semi‑CRF inference on previously intractable problem sizes. Available at https://github.com/biobenkj/flash‑semicrf.

Authors:Ruixuan Liu, David Evans, Li Xiong
Title: Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs
Abstract:
Indistinguishability properties such as differential privacy bounds or low empirically measured membership inference are widely treated as proxies to show a model is sufficiently protected against broader memorization risks. However, we show that indistinguishability properties are neither sufficient nor necessary for preventing data extraction in LLM APIs. We formalize a privacy‑game separation between extraction and indistinguishability‑based privacy, showing that indistinguishability and inextractability are incomparable: upper‑bounding distinguishability does not upper‑bound extractability. To address this gap, we introduce (l, b)‑inextractability as a definition that requires at least 2^b expected queries for any black‑box adversary to induce the LLM API to emit a protected l‑gram substring. We instantiate this via a worst‑case extraction game and derive a rank‑based extraction risk upper bound for targeted exact extraction, as well as extensions to cover untargeted and approximate extraction. The resulting estimator captures the extraction risk over multiple attack trials and prefix adaptations. We show that it can provide a tight and efficient estimation for standard greedy extraction and an upper bound on the probabilistic extraction risk given any decoding configuration. We empirically evaluate extractability across different models, clarifying its connection to distinguishability, demonstrating its advantage over existing extraction risk estimators, and providing actionable mitigation guidelines across model training, API access, and decoding configurations in LLM API deployment. Our code is publicly available at: https://github.com/Emory‑AIMS/Inextractability.

Authors:Liubomyr Horbatko
Title: Sessa: Selective State Space Attention
Abstract:
Modern sequence modeling is dominated by two families: Transformers, whose self‑attention can access arbitrary elements of the visible sequence, and structured state‑space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long‑range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention‑based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power‑law memory tails O(\ell^‑β) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba‑style baselines. We further give an explicit construction that achieves this power‑law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long‑context benchmarks while remaining competitive with Transformer and Mamba‑style baselines on short‑context language modeling.

Authors:Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp Fürnstahl, Bernhard Schölkopf, Andreas Krause
Title: Bounded Ratio Reinforcement Learning
Abstract:
Proximal Policy Optimization (PPO) has become the predominant algorithm for on‑policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage‑weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross‑Entropy Method (CEM). We additionally extend BPO to Group‑relative BPO (GBPO) for LLM fine‑tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine‑tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.

Authors:A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros
Title: Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
Abstract:
The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets (\approx1K samples) and degrades substantially as the dataset is scaled to millions of samples. The same behavior is observed beyond text‑image, for text‑audio and text‑video alignment. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine‑grained structure. Moreover, the evaluations in Huh et al. are done in a one‑to‑one image‑caption setting, a constraint that breaks down in realistic many‑to‑many settings and further reduces measured alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross‑modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

Authors:Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Maximilian Kleinegger, Dan Alistarh
Title: GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
Abstract:
Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2‑3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3‑4 bits per parameter (bpp), and "second‑generation" vector‑ or trellis‑quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel‑Softmax Quantization), a post‑training scalar quantization method which jointly learns the per‑coordinate grid assignments and the per‑group scales using a Gumbel‑Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit‑width regime (e.g., 3‑8 levels for ternary and 3 bpp, respectively), making optimization tractable. Practically, on the standard Llama‑3.1‑8B/70B‑Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group‑wise quantization, and thus remains compatible with existing scalar inference kernels. We further show that the same discrete‑assignment optimization can be applied to practical GGUF K‑Quant checkpoints: starting from publicly released GGUF models, GSQ improves accuracy while projecting the result back into the same deployment format. Finally, GSQ scales to trillion‑scale Mixture‑of‑Experts models such as Kimi‑K2.5, where vector‑quantized methods are difficult to apply. The source code is publicly available at https://github.com/IST‑DASLab/GSQ.

Authors:Aniruddha Adiga, Jingyuan Chou, Anshul Chiranth, Bryan Lewis, Ana I. Bento, Shaun Truelove, Geoffrey Fox, Madhav Marathe, Harry Hochheiser, Srini Venkatramanan
Title: IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
Abstract:
Epidemic forecasting has become an integral part of real‑time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real‑time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative‑based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information‑theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi‑horizon short‑term forecasting (1‑ to 4‑week‑ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP‑based methods have the most robust performance, with statistical methods having a slight edge during the pre‑peak phase. IDOBE dataset along with baselines are released publicly on https://github.com/NSSAC/IDOBE to enable standardized, reproducible benchmarking of outbreak forecasting methods.

Authors:Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang
Title: UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
Abstract:
Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM‑GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced‑Step and CFG‑Free, to further improve training efficiency. UDM‑GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state‑of‑the‑art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM‑GRPO.

Authors:Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin
Title: NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
Abstract:
Discrete diffusion language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive approaches, offering the flexibility to generate tokens in arbitrary orders and the potential of parallel decoding. However, existing heuristic sampling strategies remain inefficient: they choose only a small part of tokens to sample at each step, leaving substantial room for improvement. In this work, we study the problem of token sampling order optimization and demonstrate its significant potential for acceleration. Specifically, we find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that utilize a neural indicator to decide which tokens should be sampled at each step. We further propose a novel trajectory‑preserving objective to train the indicator. Experiments on LLaDA and Dream models across multiple benchmarks show that our method achieves up to 14.3× acceleration over full‑step sampling with negligible performance drop, and consistently outperforms confidence threshold sampling in the accuracy‑step trade‑off. Code is available at https://github.com/imagination‑research/NI‑Sampling.

Authors:Wei Chen, Yubing Wu, Junmei Yang, Delu Zeng, Qibin Zhao, John Paisley, Min Chen, Zhou Wang
Title: Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner
Abstract:
Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin‑based methods also suppress the chosen response when they try to suppress the rejected one, and there is no general way to prevent this across different objectives. We address this issue with a unified incentive‑score decomposition of preference optimization, revealing that different objectives share the same local update directions and differ only in their scalar weights. This decomposition provides a common framework for analyzing objectives that were previously studied in separate settings. Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the disentanglement band (DB), a simple, testable condition that tells us when training can follow the desired path: suppress the loser while preserving the winner, possibly after an early stage. Using the DB, we propose reward calibration (RC), a plug‑and‑play method that adaptively rebalances the updates for chosen and rejected responses to satisfy the DB, without redesigning the base objective. Empirical results show that RC leads to more disentangled dynamics, with better downstream performance observed across several settings. Our code is available at https://github.com/IceyWuu/DisentangledPreferenceOptimization.

Authors:Josh Millar, Ashok Samraj Thangarajan, Soumyajit Chatterjee, Hamed Haddadi
Title: Towards Real-Time ECG and EMG Modeling on $μ$NPUs
Abstract:
The miniaturisation of neural processing units (NPUs) and other low‑power accelerators has enabled their integration into microcontroller‑scale wearable hardware, supporting near‑real‑time, offline, and privacy‑preserving inference. Yet physiological signal analysis has remained infeasible on such hardware; recent Transformer‑based models show state‑of‑the‑art performance but are prohibitively large for resource‑ and power‑constrained hardware and incompatible with μNPUs due to their dynamic attention operations. We introduce PhysioLite, a lightweight, NPU‑compatible model architecture and training framework for ECG/EMG signal analysis. Using learnable wavelet filter banks, CPU‑offloaded positional encoding, and hardware‑aware layer design, PhysioLite reaches performance comparable to state‑of‑the‑art Transformer‑based foundation models on ECG and EMG benchmarks, while being <10% of the size (~370KB with 8‑bit quantization). We also profile its component‑wise latency and resource consumption on both the MAX78000 and HX6538 WE2 μNPUs, demonstrating its viability for signal analysis on constrained, battery‑powered hardware. We release our model(s) and training framework at: https://github.com/j0shmillar/physiolite.

Authors:Xiaoyuan Cheng, Haoyu Wang, Wenxuan Yuan, Ziyan Wang, Zonghao Chen, Li Zeng, Zhuo Sun
Title: Fisher Decorator: Refining Flow Policy via a Local Transport Map
Abstract:
Recent advances in flow‑based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade‑offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the L_2 regularization as an upper bound of the 2‑Wasserstein distance (W_2), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the L_2 (or upper bound of W_2) regularization is isotropic and density‑insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL‑constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state‑of‑the‑art performance across diverse offline RL benchmarks. See project page: https://github.com/ARC0127/Fisher‑Decorator.

Authors:Franki Nguimatsia Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier
Title: SVL: Goal-Conditioned Reinforcement Learning as Survival Learning
Abstract:
Standard approaches to goal‑conditioned reinforcement learning (GCRL) that rely on temporal‑difference learning can be unstable and sample‑inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time‑to‑goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed‑form identity that expresses the goal‑conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right‑censored trajectories. We introduce three practical value estimators, including finite‑horizon truncation and two binned infinite‑horizon approximations to capture long‑horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long‑horizon tasks. Webpage and Code: https://simple‑robotics.github.io/publications/survival‑value‑learning/

Authors:Mattie Ji, Indradyumna Roy, Vikas Garg
Title: Contraction and Hourglass Persistence for Learning on Graphs, Simplices, and Cells
Abstract:
Persistent homology (PH) encodes global information, such as cycles, and is thus increasingly integrated into graph neural networks (GNNs). PH methods in GNNs typically traverse an increasing sequence of subgraphs. In this work, we first expose limitations of this inclusion procedure. To remedy these shortcomings, we analyze contractions as a principled topological operation, in particular, for graph representation learning. We study the persistence of contraction sequences, which we call Contraction Homology (CH). We establish that forward PH and CH differ in expressivity. We then introduce Hourglass Persistence, a class of topological descriptors that interleave a sequence of inclusions and contractions to boost expressivity, learnability, and stability. We also study related families parametrized by two paradigms. We also discuss how our framework extends to simplicial and cellular networks. We further design efficient algorithms that are pluggable into end‑to‑end differentiable GNN pipelines, enabling consistent empirical improvements over many PH methods across standard real‑world graph datasets. Code is available at \hrefhttps://github.com/Aalto‑QuML/Hourglassthis https URL.

Authors:Harshavardhanan Deekeswar
Title: ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization
Abstract:
Serialization formats designed for document interchange impose structural overhead that becomes prohibitive when large language models consume operational data at scale. A modest dataset of 1,000 IoT sensor readings serialized as JSON requires approximately 80,000 tokens ‑ the majority spent on repeated field names, nested braces, and structural punctuation rather than semantic content. We present ONTO (Object Notation for Token Optimization), a columnar notation that declares field names once per entity and arranges values in pipe‑delimited rows with indentation‑based hierarchy. This schema‑once, data‑many design eliminates per‑record key repetition while preserving human readability and nested structure support. Evaluation across three synthetic operational datasets demonstrates 46‑51% token reduction versus JSON, with stable scaling from 100 to 1,000 records. Controlled inference benchmarks on Qwen2.5‑7B show corresponding 5‑10% latency improvement. Comprehension validation confirms no material degradation in LLM task accuracy across lookup, counting, extraction, and aggregation operations when format context is provided. Ablation analysis reveals that key repetition accounts for the majority of JSON overhead, with indentation costs in nested structures explaining the 4‑percentage‑point gap between flat and hierarchical data. ONTO occupies a previously unfilled position in the serialization landscape: columnar efficiency with hierarchical structure, optimized for LLM context windows rather than document interchange. Code and specification are available at https://github.com/harsh‑aranga/onto.

Authors:Qihao Shen, Jiaxing Xuan, Zhenguang Liu, Sifan Wu, Yutong Xie, Zhaoyan Ming, Yingying Jiao, kui Ren
Title: Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection
Abstract:
Advanced deepfake technologies are blurring the lines between real and fake, presenting both revolutionary opportunities and alarming threats. While it unlocks novel applications in fields like entertainment and education, its malicious use has sparked urgent ethical and societal concerns ranging from identity theft to the dissemination of misinformation. To tackle these challenges, feature analysis using frequency features has emergedas a promising direction for deepfake detection. However, oneaspect that has been overlooked so far is that existing methodstend to concentrate on one or a few specific frequency domains,which risks overfitting to particular artifacts and significantlyundermines their robustness when facing diverse forgery patterns. Another underexplored aspect we observe is that different features often attend to the same forged region, resulting in redundant feature representations and limiting the diversity of the extracted clues. This may undermine the ability of a model to capture complementary information across different facets, thereby compromising its generalization capability to diverse manipulations. In this paper, we seek to tackle these challenges from two aspects: (1) we propose a triple‑branch network that jointly captures spatial and frequency features by learning from both original image and image reconstructed by different frequency channels, and (2) we mathematically derive feature decoupling and fusion losses grounded in the mutual information theory, which enhances the model to focus on task‑relevant features across the original image and the image reconstructed by different frequency channels. Extensive experiments on six large‑scale benchmark datasets demonstrate that our method consistently achieves state‑of‑the‑art performance. Our code is released at https://github.com/injooker/Unveiling Deepfake.

Authors:Keyang Chen, Mingxuan Jiang, Yongsheng Zhao, Zeping Li, Zaiyuan Chen, Weiqi Luo, Zhixin Li, Sen Liu, Yinan Jing, Guangnan Ye, Xihong Wu, Hongfeng Chai
Title: TransXion: A High-Fidelity Graph Benchmark for Realistic Anti-Money Laundering
Abstract:
Money laundering poses severe risks to global financial systems, driving the widespread adoption of machine learning for transaction monitoring. However, progress remains stifled by the lack of realistic benchmarks. Existing transaction‑graph datasets suffer from two pervasive limitations: (i) they provide sparse node‑level semantics beyond anonymized identifiers, and (ii) they rely on template‑driven anomaly injection, which biases benchmarks toward static structural motifs and yields overly optimistic assessments of model robustness. We propose TransXion, a benchmark ecosystem for Anti‑Money Laundering (AML) research that integrates profile‑aware simulation of normal activity with stochastic, non‑template synthesis of illicit subgraphs.TransXion jointly models persistent entity profiles and conditional transaction behavior, enabling evaluation of "out‑of‑character" anomalies where observed activity contradicts an entity's socio‑economic context. The resulting dataset comprises approximately 3 million transactions among 50,000 entities, each endowed with rich demographic and behavioral attributes. Empirical analyses show that TransXion reproduces key structural properties of payment networks, including heavy‑tailed activity distributions and localized subgraph structure. Across a diverse array of detection models spanning multiple algorithmic paradigms, TransXion yields substantially lower detection performance than widely used benchmarks, demonstrating increased difficulty and realism. TransXion provides a more faithful testbed for developing context‑aware and robust AML detection methods. The dataset and code are publicly available at https://github.com/chaos‑max/TransXion.

Authors:Kadir-Kaan Özer, René Ebeling, Markus Enzweiler
Title: Back to Repair: A Minimal Denoising Network\ for Time Series Anomaly Detection
Abstract:
We introduce JuRe (Just Repair), a minimal denoising network for time series anomaly detection that exposes a central finding: architectural complexity is unnecessary when the training objective correctly implements the manifold‑projection principle. JuRe consists of a single depthwise‑separable convolutional residual block with hidden dimension 128, trained to repair corrupted time series windows and scored at inference by a fixed, parameter‑free structural discrepancy function. Despite using no attention, no latent variable, and no adversarial component, JuRe ranks second on the TSB‑AD multivariate benchmark (AUC‑PR 0.404, 180 series, 17 datasets) and second on the UCR univariate archive by AUC‑PR (0.198, 250 series), leading all neural baselines on AUC‑PR and VUS‑PR. Component ablation on TSB‑AD identifies training‑time corruption as the dominant factor (ΔAUC‑PR = 0.047 on removal), confirming that the denoising objective, not network capacity, drives detection quality. Pairwise Wilcoxon signed‑rank tests establish statistical significance against 21 of 25 baselines on TSB‑AD. Code is available at the URL https://github.com/iis‑esslingen/JuRe.

Authors:Yifan Zhang, Jieyu Li, Kexin Pei, Yu Huang, Kevin Leach
Title: SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
Abstract:
Large Language Models (LLMs) show promise for automated code repair but often struggle with the complex semantic and structural correctness required. We present SynthFix, a hybrid neural‑symbolic framework that improves LLM‑based vulnerability repair by unifying code synthesis with compiler‑informed symbolic feedback. The core of our approach is an adaptive training strategy where a neural Router Model directs code samples to either Supervised Fine‑Tuning (SFT) to learn common patterns or Reward Fine‑Tuning (RFT) with symbolic rewards for complex, iterative refinement. On the FixJS (JavaScript) and CodeFlaws (C) benchmarks, SynthFix achieves up to 18% relative improvement in CodeBLEU/CrystalBLEU and 32% in Exact Match over strong SFT and RFT baselines. Our results show that this adaptive combination of training strategies, which mirrors how developers alternate between pattern application and tool feedback, significantly improves the accuracy and efficiency of LLM‑based vulnerability repair. Our code and data are available at https://github.com/CoderDoge1108/SynthFix.

Authors:Tianbao Zhang
Title: Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework (CAAF)
Abstract:
Large Language Models (LLMs) produce a controllability gap in safety‑critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic compliance, context attention decay [Liu et al., 2024], and stochastic oscillation during self‑correction [Huang et al., 2024]. We introduce the Convergent AI Agent Framework (CAAF), which transitions agentic workflows from open‑loop generation to closed‑loop Fail‑Safe Determinism via three pillars: (1) Recursive Atomic Decomposition with physical context firewalls; (2) Harness as an Asset, formalizing domain invariants into machine‑readable registries enforced by a deterministic Unified Assertion Interface (UAI); and (3) Structured Semantic Gradients with State Locking for monotonic convergence. Empirical evaluation across two domains ‑‑ SAE Level 3 (L3) autonomous driving (AD) (n=30, 7 conditions) and pharmaceutical continuous flow reactor design (n=20, 4 conditions including a Mono+UAI ablation) ‑‑ shows that CAAF‑all‑GPT‑4o‑mini achieves 100% paradox detection while monolithic GPT‑4o achieves 0% (even at temperature=0). The pharmaceutical benchmark features 7 simultaneous constraints with nonlinear Arrhenius interactions and a 3‑way minimal unsatisfiable subset, representing a structurally harder challenge than the 2‑constraint AD paradox. Alternative multi‑agent architectures (debate, sequential checking) also achieve 0% across 80 trials, confirming that CAAF's reliability derives from its deterministic UAI, not from multi‑agent orchestration per se. A Mono+UAI ablation (95%) isolates UAI as the core contribution. CAAF's reliability is invariant to prompt hints; all components use a single commodity model, enabling fully offline deployment.

Authors:Sai Vegasena
Title: Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Abstract:
We present Open‑TQ‑Metal, the first implementation of fused compressed‑domain attention on Apple Silicon, enabling 128K‑context inference for Llama 3.1 70B on a single 64GB consumer Mac ‑‑ a configuration impossible with all existing inference frameworks. Open‑TQ‑Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize‑then‑attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top‑1 token predictions to FP16 inference. We further provide the first cross‑architecture analysis of KV cache quantization methods, revealing that the attention scale factor ‑‑ not model size ‑‑ determines whether angular quantization schemes like PolarQuant succeed or fail, with Gemma 4's attn_scale=1.0 amplifying directional error 25‑100x more than Llama's standard 1/sqrt(d) scaling.

Authors:Yingzhi Xia, Setthakorn Tanomkiattikun, Liangli Zhen, Zaiwang Gu
Title: Noise-Adaptive Diffusion Sampling for Inverse Problems Without Task-Specific Tuning
Abstract:
Diffusion models (DMs) have recently shown remarkable performance on inverse problems (IPs). Optimization‑based methods can fast solve IPs using DMs as powerful regularizers, but they are susceptible to local minima and noise overfitting. Although DMs can provide strong priors for Bayesian approaches, enforcing measurement consistency during the denoising process leads to manifold infeasibility issues. We propose Noise‑space Hamiltonian Monte Carlo (N‑HMC), a posterior sampling method that treats reverse diffusion as a deterministic mapping from initial noise to clean images. N‑HMC enables comprehensive exploration of the solution space, avoiding local optima. By moving inference entirely into the initial‑noise space, N‑HMC keeps proposals on the learned data manifold. We provide a comprehensive theoretical analysis of our approach and extend the framework to a noise‑adaptive variant (NA‑NHMC) that effectively handles IPs with unknown noise type and level. Extensive experiments across four linear and three nonlinear inverse problems demonstrate that NA‑NHMC achieves superior reconstruction quality with robust performance across different hyperparameters and initializations, significantly outperforming recent state‑of‑the‑art methods. The code is available at https://github.com/NA‑HMC/NA‑HMC.

Authors:Weiyu Ma, Yongcheng Zeng, Yan Song, Xinyu Cui, Jian Zhao, Xuhui Liu, Mohamed Elhoseiny
Title: Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Abstract:
Reinforcement Learning (RL) has achieved impressive success in post‑training Large Language Models (LLMs) and Vision‑Language Models (VLMs), with on‑policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi‑turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion‑parameter models renders stored priorities stale, causing old high‑priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness‑Aware PER, which addresses this priority staleness problem by augmenting any PER‑based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness‑Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi‑step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness‑Aware PER significantly outperforms on‑policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision‑CAIR/Freshness‑Aware‑PER.

Authors:Zahid Hasan, Masud Ahmed, Nirmalya Roy
Title: Lorentz Framework for Semantic Segmentation
Abstract:
Semantic segmentation in hyperbolic space enables compact modeling of hierarchical structure while providing inherent uncertainty quantification. Prior approaches predominantly rely on the Poincaré ball model, which suffers from numerical instability, optimization, and computational challenges. We propose a novel, tractable, architecture‑agnostic semantic segmentation framework (pixel‑wise and mask classification) in the hyperbolic Lorentz model. We employ text embeddings with semantic and visual cues to guide hierarchical pixel‑level representations in Lorentz space. This enables stable and efficient optimization without requiring a Riemannian optimizer, and easily integrates with existing Euclidean architectures. Beyond segmentation, our approach yields free uncertainty estimation, confidence map, boundary delineation, hierarchical and text‑based retrieval, and zero‑shot performance, reaching generalized flatter minima. We introduce a novel uncertainty and confidence indicator in Lorentz cone embeddings. Further, we provide analytical and empirical insights into Lorentz optimization via gradient analysis. Extensive experiments on ADE20K, COCO‑Stuff‑164k, Pascal‑VOC, and Cityscapes, utilizing state‑of‑the‑art per‑pixel classification models (DeepLabV3 and SegFormer) and mask classification models (mask2former and maskformer), validate the effectiveness and generality of our approach. Our results demonstrate the potential of hyperbolic Lorentz embeddings for robust and uncertainty‑aware semantic segmentation. Code is available at https://github.com/mxahan/Lorentz_semantic_segmentation.

Authors:Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, Chien-Sheng Wu
Title: The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
Abstract:
On‑policy distillation (OPD) is an increasingly important paradigm for post‑training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment‑time information. We formalize this perspective theoretically, showing that teacher‑conditioned success is generally not a valid target for deployment‑time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration‑aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self‑reported confidence with this student‑grounded target, and distills the revised response through the same self‑distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto‑optimal calibration while maintaining competitive capability, generalizing robustly under out‑of‑distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post‑training. Code: https://github.com/SalesforceAIResearch/CaOPD

Authors:Yunshan Peng, Ji Wu, Wentao Bai, Yunke Bai, Jinan Pang, Wenzheng Shu, Yanxiang Zeng, Xialong Liu, Peng Jiang
Title: R&F-Inventory: A Large-Scale Dataset for Monotonic Inventory Estimation in Reach and Frequency Advertising
Abstract:
Reach and Frequency (R&F) contract advertising is an important form of widely used brand advertising. Unlike performance advertising, R&F contracts emphasize controllable delivery of UV and PV under given targeting, scheduling, and frequency control constraints. In practical systems, advertisers typically need to view the UV, PV change curves at different budget levels in real time when creating an R&F contract. However, most existing publicly available advertising datasets are based on independent samples, lacking a characterization of the core structure of the "budget‑performance curve" (including UV and PV) in R&F contracts.This paper proposes and releases a large‑scale R&F contract inventory estimation dataset. This dataset uses the R&F contract context consisting of "targeting‑scheduling‑frequency control" as the basic context, providing observations of UV and PV corresponding to multiple budget points within the same context, thus forming a complete budget‑performance curve. The dataset explicitly includes a time‑window‑based frequency control mechanism (e.g.,"no more than 3 times within 5 days") and naturally satisfies the monotonicity and diminishing marginal returns characteristics in the budget and scheduling dimensions. We further derive the theoretical maximum exposure ceiling and use it as a consistency check to evaluate data quality and the feasibility of model predictions. Using this data set, this paper defines two standardized benchmark tasks: single‑point performance prediction and reconstruction of budget‑performance curves, and provides a set of reproducible baseline methods and evaluation protocols. This dataset can support systematic research on problems such as structural constraint learning, monotonic regression, curve consistency modeling, and R&F contract planning.The code for our experiments can be found at https://github.com/pengyunshan/RF‑Inventory.

Authors:Chongsheng Zhang, Hao Wang, Zelong Yu, Esteban Garces Arias, Julian Rodemann, Zhanshuo Zhang, Qilong Li, Gaojuan Fan, Krikamol Muandet, Christian Heumann
Title: Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration
Abstract:
Imbalanced data is commonly present in real‑world applications. While data synthesis can effectively mitigate the data scarcity problem of rare‑classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in‑context learning framework that employs progressive chain‑of‑thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in‑context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self‑reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at https://github.com/cszhangLMU/RDDG.

Authors:Lena Zellinger, Nicola Branchini, Lennert De Smet, Víctor Elvira, Nikolay Malkin, Antonio Vergari
Title: How to Approximate Inference with Subtractive Mixture Models
Abstract:
Classical mixture models (MMs) are widely used tractable proposals for approximate inference settings such as variational inference (VI) and importance sampling (IS). Recently, mixture models with negative coefficients, called subtractive mixture models (SMMs), have been proposed as a potentially more expressive alternative. However, how to effectively use SMMs for VI and IS is still an open question as they do not provide latent variable semantics and therefore cannot use sampling schemes for classical MMs. In this work, we study how to circumvent this issue by designing several expectation estimators for IS and learning schemes for VI with SMMs, and we empirically evaluate them for distribution approximation. Finally, we discuss the additional challenges in estimation stability and learning efficiency that they carry and propose ways to overcome them. Code is available at: https://github.com/april‑tools/delta‑vi.

Authors:Montgomery Bohde, Hongxuan Liu, Mrunali Manjrekar, Magdalena Lederbauer, Shuiwang Ji, Runzhong Wang, Connor W. Coley
Title: FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time
Abstract:
In this work, we present FRIGID, a framework with a novel diffusion language model that generates molecular structures conditioned on mass spectra via intermediate fingerprint representations and determined chemical formulae, training at the scale of hundreds of millions of unlabeled structures. We then demonstrate how forward fragmentation models enable inference‑time scaling by identifying spectrum‑inconsistent fragments and refining them through targeted remasking and denoising. While FRIGID already achieves strong performance with its diffusion base, inference‑time scaling significantly improves its accuracy, surpassing 18% Top‑1 accuracy on the challenging MassSpecGym benchmark and tripling the Top‑1 accuracy of the leading methods on NPLIB1. Further empirical analyses show that FRIGID exhibits log‑linear performance scaling with increasing inference‑time compute, opening a promising new direction for continued improvements in de novo structural elucidation. FRIGID code is publicly available at https://github.com/coleygroup/FRIGID

Authors:Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, Sean Welleck
Title: AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
Abstract:
Recent large language model (LLM) agents have shown promise in using execution feedback for test‑time adaptation. However, robust self‑improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain‑specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non‑linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self‑improvement via accumulated execution feedback for performance‑critical kernel code generation through two complementary stages: failure‑driven adaptation and diversity‑preserving search, jointly improving correctness and optimization performance without additional fine‑tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level‑2 and Level‑3, respectively, within 100 steps, and continues to improve with additional computation.

Authors:Zongru Li, Xingsheng Chen, Honggang Wen, Regina Qianru Zhang, Ming Li, Xiaojin Zhang, Hongzhi Yin, Qiang Yang, Kwok-Yan Lam, Pietro Lio, Siu-Ming Yiu
Title: A Systematic Survey and Benchmark of Deep Learning for Molecular Property Prediction in the Foundation Model Era
Abstract:
Molecular property prediction integrates quantum chemistry, cheminformatics, and deep learning to connect molecular structure with physicochemical and biological behavior. This survey traces four complementary paradigms, including Quantum, Descriptor Machine Learning, Geometric Deep Learning, and Foundation Models, and outlines a unified taxonomy linking molecular representations, model architectures, and interdisciplinary applications. Benchmark analyses integrate evidence from both widely used datasets and datasets reflecting industry perspectives, encompassing quantum, physicochemical, physiological, and biophysical domains. The survey examines current standards in data curation, splitting strategies, and evaluation protocols, highlighting challenges including inconsistent stereochemistry, heterogeneous assay sources, and reproducibility limitations under random or poorly defined splits. These observations motivate the modernization of benchmark design toward more transparent, time‑ and scaffold‑aware methodologies. We further propose three forward‑looking directions: (i) physics‑aware learning embedding quantum consistency, (ii) uncertainty‑calibrated foundation models for trustworthy inference, and (iii) realistic multimodal benchmark ecosystems integrating computational and experimental data. Repository: https://github.com/Zongru‑Li/Survey‑and‑Benchmarks‑of‑DL‑for‑Molecular‑Property‑Prediction‑in‑the‑Foundation‑Model‑Era.

Authors:Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu
Title: BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
Abstract:
Autoregressive vision‑language models (VLMs) deliver strong multimodal capability, but their token‑by‑token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large‑block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same‑architecture, decoding‑efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage‑wise intra‑dVLM distillation from a fixed small‑block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory‑friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive‑to‑diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with \leq 4.4M data, BARD‑VL transfers strong multimodal capability from Qwen3‑VL to a large‑block dVLM. Remarkably, BARD‑VL establishes a new SOTA among comparable‑scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD‑VL achieves up to 3× decoding throughput speedup compared to the source model. Code is available at: \hrefhttps://github.com/fudan‑generative‑vision/Bard‑VLthis~https~URL.

Authors:Devendra Ghori
Title: Aletheia: Physics-Conditioned Localized Artifact Attention (PhyLAA-X) for End-to-End Generalizable and Robust Deepfake Video Detection
Abstract:
State‑of‑the‑art deepfake detectors achieve near‑perfect in‑domain accuracy yet degrade under cross‑generator shifts, heavy compression, and adversarial perturbations. The core limitation remains the decoupling of semantic artifact learning from physical invariants: optical‑flow discontinuities, specular‑reflection inconsistencies, and cardiac‑modulated reflectance (rPPG) are treated either as post‑hoc features or ignored. We introduce PhyLAA‑X, a novel physics‑conditioned extension of Localized Artifact Attention (LAA‑X). PhyLAA‑X injects three end‑to‑end differentiable physics‑derived feature volumes ‑ optical‑flow curl, specular‑reflectance skewness, and spatially‑upsampled rPPG power spectra ‑ directly into the LAA‑X attention computation via cross‑attention gating and a resonance consistency loss. This forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co‑occur ‑ regions inherently harder for generative models to replicate consistently. PhyLAA‑X is embedded across an efficient spatiotemporal ensemble (EfficientNet‑B4+BiLSTM, ResNeXt‑101+Transformer, Xception+causal Conv1D) with uncertainty‑aware adaptive weighting. On FaceForensics++ (c23), Aletheia reaches 97.2% accuracy / 0.992 AUC‑ROC; on Celeb‑DF v2, 94.9% / 0.981; on DFDC, 90.8% / 0.966 ‑ outperforming the strongest published baseline (LAA‑Net [1]) by 4.1‑7.3% in cross‑generator settings and maintaining 79.4% accuracy under epsilon = 0.02 PGD‑10 attacks. Single‑backbone ablations confirm PhyLAA‑X alone delivers a 4.2% cross‑dataset AUC gain. The full production system is open‑sourced at https://github.com/devghori1264/Aletheia (v1.2, April 2026) with pretrained weights, the adversarial corpus (referred to as ADC‑2026 in this work), and complete reproducibility artifacts.

Authors:Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang, Qiankun Li, Kun Wang, Yang Liu, Zhigang Zeng
Title: SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics
Abstract:
MLLMs are increasingly deployed in multi‑turn settings, where attackers can escalate unsafe intent through the evolving visual‑text history and exploit long‑context safety decay. Yet safety alignment is still dominated by single‑turn data and fixed‑template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR‑Steer, a progressive multi‑turn alignment framework that combines staged synthetic bootstrapping with tutor‑in‑the‑loop GRPO to train a single student under adaptive, on‑policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late‑turn failures to earlier turns.I. Dataset. We release STEER, a multi‑turn multimodal safety dataset with STEER‑SFT (12,934), STEER‑RL (2,000), and STEER‑Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5‑VL‑3B/7B, SaFeR‑Steer substantially improves Safety/Helpfulness on both single‑turn (48.30/45.86 ‑> 81.84/70.77 for 3B; 56.21/60.32 ‑> 87.89/77.40 for 7B) and multi‑turn benchmarks (12.55/27.13 ‑> 55.58/70.27 for 3B; 24.66/46.48 ‑> 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed‑Bg/SaFeR‑Steer

Authors:Vladimer Khasia
Title: BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"
Abstract:
The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L BN ) spatial bottleneck (where B is the sequence‑batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank‑R tensors. To solve the foundational instability of sketched gradients, we propose two novel mechanisms: Balanced Hashing, which strictly eliminates off‑diagonal collision variance, and Invariant Scalars, a principled bias‑variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry. Theoretically, BASIS reduces activation memory to O(L RN ) and heavily decreases the backward pass matrix‑multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator. The code is available at https://github.com/VladimerKhasia/basis

Authors:Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen, Greg Durrett, Xi Ye
Title: Detecting and Suppressing Reward Hacking with Gradient Fingerprints
Abstract:
Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward‑hacking behaviors are often implicit, as the intermediate chain‑of‑thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text‑based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model‑generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine‑tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: https://github.com/songtao‑x/reward_hack.

Authors:Weijiang Xiong, Robert Fonod, Nikolas Geroliminis
Title: Unveiling Stochasticity: Universal Multi-modal Probabilistic Modeling for Traffic Forecasting
Abstract:
Traffic forecasting is a challenging spatio‑temporal modeling task and a critical component of urban transportation management. Current studies mainly focus on deterministic predictions, with limited considerations on the uncertainty and stochasticity in traffic dynamics. Therefore, this paper proposes an elegant yet universal approach that transforms existing models into probabilistic predictors by replacing only the final output layer with a novel Gaussian Mixture Model (GMM) layer. The modified model requires no changes to the training pipeline and can be trained using only the Negative Log‑Likelihood (NLL) loss, without any auxiliary or regularization terms. Experiments on multiple traffic datasets show that our approach generalizes from classic to modern model architectures while preserving deterministic performance. Furthermore, we propose a systematic evaluation procedure based on cumulative distributions and confidence intervals, and demonstrate that our approach is considerably more accurate and informative than unimodal or deterministic baselines. Finally, a more detailed study on a real‑world dense urban traffic network is presented to examine the impact of data quality on uncertainty quantification and to show the robustness of our approach under imperfect data conditions. Code available at https://github.com/Weijiang‑Xiong/OpenSkyTraffic

Authors:Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang
Title: Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
Abstract:
Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non‑learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets ‑ for instance, boosting GPT‑OSS‑20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real‑world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP

Authors:Zhaobo Hu, Vincent Gauthier, Mehdi Naima
Title: Modern Structure-Aware Simplicial Spatiotemporal Neural Network
Abstract:
Spatiotemporal modeling has evolved beyond simple time series analysis to become fundamental in structural time series analysis. While current research extensively employs graph neural networks (GNNs) for spatial feature extraction with notable success, these networks are limited to capturing only pairwise relationships, despite real‑world networks containing richer topological relationships. Additionally, GNN‑based models face computational challenges that scale with graph complexity, limiting their applicability to large networks. To address these limitations, we present Modern Structure‑Aware Simplicial SpatioTemporal neural network (ModernSASST), the first approach to leverage simplicial complex structures for spatiotemporal modeling. Our method employs spatiotemporal random walks on high‑dimensional simplicial complexes and integrates parallelizable Temporal Convolutional Networks to capture high‑order topological structures while maintaining computational efficiency. Our source code is publicly available on GitHub\footnoteCode is available at: https://github.com/ComplexNetTSP/ST_RUM.

Authors:Oluwaleke Yusuf, M. Tsaqif Wismadi, Adil Rasheed
Title: Similarity-Based Bike Station Expansion via Hybrid Denoising Autoencoders
Abstract:
Urban bike‑sharing systems require strategic station expansion to meet growing demand. Traditional allocation approaches rely on explicit demand modelling that may not capture the urban characteristics distinguishing successful stations. This study addresses the need to exploit patterns from existing stations to inform expansion decisions, particularly in data‑constrained environments. We present a data‑driven framework leveraging existing stations deemed desirable by operational metrics. A hybrid denoising autoencoder (HDAE) learns compressed latent representations from multi‑source grid‑level features (socio‑demographic, built environment, and transport network), with a supervised classification head regularising the embedding space structure. Expansion candidates are selected via greedy allocation with spatial constraints based on latent‑space similarity to existing stations. Evaluation on Trondheim's bike‑sharing network demonstrates that HDAE embeddings yield more spatially coherent clusters and allocation patterns than raw features. Sensitivity analyses across similarity methods and distance metrics confirm robustness. A consensus‑based procedure across multiple parametrisations distils 32 high‑confidence extension zones where all parametrisations agree. The results demonstrate how representation learning captures complex patterns that raw features miss, enabling evidence‑based expansion planning without explicit demand modelling. The consensus procedure strengthens recommendations by requiring agreement across parametrisations, while framework configurability allows planners to incorporate operational knowledge. The methodology generalises to any location‑allocation problem where existing desirable instances inform the selection of new candidates.

Authors:Junguang Yao, Wenye Liu, Stjepan Picek, Yue Zheng
Title: NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition
Abstract:
Visual speaker recognition based on lip motion offers a silent, hands‑free, and behavior‑driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance‑dependent representations, lip motion encodes subject‑specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine‑grained dynamics is challenging for conventional frame‑based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event‑based framework that captures fine‑grained lip dynamics under a strict yet practical cross‑scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal‑aware Voxel Encoding module with adaptive event weighting, 2) Structure‑aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion‑direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event‑based lip‑motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near‑perfect matched‑scene accuracy and robust cross‑scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low‑light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.

Authors:Jon-Paul Cacioli
Title: The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
Abstract:
We introduce a cross‑domain behavioural assay of monitoring‑control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1‑T5 were pre‑registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced‑choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson‑Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r = .17, 95% CI wide given n=20; exemplar‑based evidence is the primary support). Scaling on metacognitive calibration is architecture‑dependent: monotonically decreasing (Qwen), monotonically increasing (GPT‑5.4), or flat (Gemma). Behavioural findings converge structurally with an independent Type‑2 SDT approach, providing preliminary cross‑method construct validity. All items, data, and code: https://github.com/synthiumjp/metacognitive‑monitoring‑battery.

Authors:Xinge Liu, Terry Jingchen Zhang, Bernhard Schölkopf, Zhijing Jin, Kristen Menou
Title: Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Abstract:
The rise of autonomous AI agents suggests that dynamic benchmark environments with built‑in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics‑grounded model‑fitting tasks using inference on radial‑velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high‑SNR single‑planet systems to complex multi‑planetary configurations requiring involved low‑SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test‑time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model‑fitting problem of practical research relevance today. Our methodology to design a simulation‑driven environment for AI agents presumably generalizes to many other model‑fitting problems across scientific domains. Source code and the project website are available at https://github.com/Gudmorning2025/Stargazer and https://gudmorning2025.github.io/Stargazer, respectively.

Authors:Rafael T. Sereicikas, Pedro R. Pires, Gregorio F. Azevedo, Tiago A. Almeida
Title: Learning Behaviorally Grounded Item Embeddings via Personalized Temporal Contexts
Abstract:
Effective user modeling requires distinguishing between short‑term and long‑term preference evolution. While item embeddings have become a key component of recommender systems, standard approaches like Item2Vec treat user histories as unordered sets (bag‑of‑items), implicitly assuming that interactions separated by minutes are as semantically related as those separated by months. This simplification flattens the rich temporal structure of user behavior, obscuring the distinction between coherent consumption sessions and gradual interest drifts. In this work, we introduce TAI2Vec (Time‑Aware Item‑to‑Vector), a family of lightweight embedding models that integrates temporal proximity directly into the representation learning process. Unlike approaches that apply global time constraints, TAI2Vec is user‑adaptive, tailoring its temporal definitions to individual interaction paces. We propose two complementary strategies: TAI2Vec‑Disc, which utilizes personalized anomaly detection to dynamically segment interactions into semantic sessions, and TAI2Vec‑Cont, which employs continuous, user‑specific decay functions to weigh item relationships based on their relative temporal distance. Experimental results across eight diverse datasets demonstrate that TAI2Vec consistently produces more accurate and behaviorally grounded representations than static baselines, achieving competitive or superior performance in over 80% of the datasets, with improvements of up to 135%. The source code is publicly available at https://github.com/UFSCar‑LaSID/tai2vec.

Authors:Pedro R. Pires, Rafael T. Sereicikas, Gregorio F. Azevedo, Tiago A. Almeida
Title: Collaborative Filtering Through Weighted Similarities of User and Item Embeddings
Abstract:
In recent years, neural networks and other complex models have dominated recommender systems, often setting new benchmarks for state‑of‑the‑art performance. Yet, despite these advancements, award‑winning research has demonstrated that traditional matrix factorization methods can remain competitive, offering simplicity and reduced computational overhead. Hybrid models, which combine matrix factorization with newer techniques, are increasingly employed to harness the strengths of multiple approaches. This paper proposes a novel ensemble method that unifies user‑item and item‑item recommendations through a weighted similarity framework to deliver top‑N recommendations. Our approach is distinctive in its use of shared user and item embeddings for both recommendation strategies, simplifying the architecture and enhancing computational efficiency. Extensive experiments across multiple datasets show that our method achieves competitive performance and is robust in varying scenarios that favor either user‑item or item‑item recommendations. Additionally, by eliminating the need for embedding‑specific fine‑tuning, our model allows for the seamless reuse of hyperparameters from the base algorithm without sacrificing performance. This results in a method that is both efficient and easy to implement. Our open‑source implementation is available at https://github.com/UFSCar‑LaSID/weighted‑sims‑recommender.

Authors:Mohammad Mahdi Abootorabi, Parvin Mousavi, Purang Abolmaesumi, Evan Shelhamer
Title: ProtoTTA: Prototype-Guided Test-Time Adaptation
Abstract:
Deep networks that rely on prototypes‑interpretable representations that can be related to the model input‑have gained significant attention for balancing high accuracy with inherent interpretability, which makes them suitable for critical domains such as healthcare. However, these models are limited by their reliance on training data, which hampers their robustness to distribution shifts. While test‑time adaptation (TTA) improves the robustness of deep networks by updating parameters and statistics, the prototypes of interpretable models have not been explored for this purpose. We introduce ProtoTTA, a general framework for prototypical models that leverages intermediate prototype signals rather than relying solely on model outputs. ProtoTTA minimizes the entropy of the prototype‑similarity distribution to encourage more confident and prototype‑specific activations on shifted data. To maintain stability, we employ geometric filtering to restrict updates to samples with reliable prototype activations, regularized by prototype‑importance weights and model‑confidence scores. Experiments across four prototypical backbones on four diverse benchmarks spanning fine‑grained vision, histopathology, and NLP demonstrate that ProtoTTA improves robustness over standard output entropy minimization while restoring correct semantic focus in prototype activations. We also introduce novel interpretability metrics and a vision‑language model (VLM) evaluation framework to explain TTA dynamics, confirming ProtoTTA restores human‑aligned semantic focus and correlates reliably with VLM‑rated reasoning quality. Code is available at: https://github.com/DeepRCL/ProtoTTA.

Authors:Zixuan Weng, Jinghuai Zhang, Kunlin Cai, Ying Li, Peiran Wang, Yuan Tian
Title: FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
Abstract:
Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference‑time steering offers a cost‑effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility‑preserving, and training‑efficient due to their rigid, one‑size‑fits‑all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference‑time steering into two complementary stages: conditional steering and fine‑grained vector synthesis, allowing fine‑grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace‑guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture‑of‑Steering‑Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query‑specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training‑efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state‑of‑the‑art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at https://github.com/YukinoAsuna/FineSteer

Authors:G. Aytug Akarlar
Title: Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
Abstract:
We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same‑prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt‑level confounds. On Qwen2.5‑1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random‑patch control. Window patching shows correction requires sustained multi‑step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step‑0 residual states predict per‑prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000‑permutation null); unsupervised clustering identifies five regime‑like groups (eta^2 = 0.55) whose saddle‑adjacent cluster concentrates 12 of the 13 bifurcating false‑premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.

Authors:Sanjeev Panta, Rhett M Morvant, Xu Yuan, Li Chen, Nian-Feng Tzeng
Title: M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention
Abstract:
Accurate and timely rainfall nowcasting is crucial for disaster mitigation and water resource management. Despite recent advances in deep learning, precipitation prediction remains challenging due to limitations in effectively leveraging diverse multimedia data sources. We introduce M3R, a Meteorology‑informed MultiModal attention‑based architecture for direct Rainfall prediction that synergistically combines visual NEXRAD radar imagery with numerical Personal Weather Station (PWS) measurements, using a comprehensive pipeline for temporal alignment of heterogeneous meteorological data. With specialized multimodal attention mechanisms, M3R novelly leverages weather station time series as queries to selectively attend to spatial radar features, enabling focused extraction of precipitation signatures. Experimental results for three spatial areas of 100 km 100 km centered at NEXRAD radar stations demonstrate that M3R outperforms existing approaches, achieving substantial improvements in accuracy, efficiency, and precipitation detection capabilities. Our work establishes new benchmarks for multimedia‑based precipitation nowcasting and provides practical tools for operational weather prediction systems. The source code is available at https://github.com/Sanjeev97/M3Rain

Authors:Kieran A. Murphy
Title: InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control
Abstract:
We propose InfoChess, a symmetric adversarial game that elevates competitive information acquisition to the primary objective. There is no piece capture, removing material incentives that would otherwise confound the role of information. Instead, pieces are used to alter visibility. Players are scored on their probabilistic inference of the opponent's king location over the duration of the game. To explore the space of strategies for playing InfoChess, we introduce a hierarchy of heuristic agents defined by increasing levels of opponent modeling, and train a reinforcement learning agent that outperforms these baselines. Leveraging the discrete structure of the game, we analyze gameplay through natural information‑theoretic characterizations that include belief entropy, oracle cross entropy, and predictive log score under the action‑induced observation channel. These measures disentangle epistemic uncertainty, calibration mismatch, and uncertainty induced by adversarial movement. The design of InfoChess renders it a testbed for studying multi‑agent inference under partial observability. We release code for the environment and agents, and a public interface to encourage further study.

Authors:Yury Gorishniy, Ivan Rubachev, Dmitrii Feoktistov, Artem Babenko
Title: Benchmarking Optimizers for MLPs in Tabular Deep Learning
Abstract:
MLP is a heavily used backbone in modern deep learning (DL) architectures for supervised learning on tabular data, and AdamW is the go‑to optimizer used to train tabular DL models. Unlike architecture design, however, the choice of optimizer for tabular DL has not been examined systematically, despite new optimizers showing promise in other domains. To fill this gap, we benchmark 15 optimizers on 17 tabular datasets for training MLP‑based models in the standard supervised learning setting under a shared experiment protocol. Our main finding is that the Muon optimizer consistently outperforms AdamW, and thus should be considered a strong and practical choice for practitioners and researchers, if the associated training efficiency overhead is affordable. Additionally, we find exponential moving average of model weights to be a simple yet effective technique that improves AdamW on vanilla MLPs, though its effect is less consistent across model variants.

Authors:Tianhao Fu, Austin Wang, Charles Chen, Roby Aldave-Garza, Yucheng Chen
Title: SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation
Abstract:
Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single‑forward‑pass alternatives often provide weaker failure ranking or rely on restrictive feature‑space assumptions. We present SegWithU, a post‑hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank‑1 posterior probes. It produces two voxel‑wise uncertainty maps: a calibration‑oriented map for probability tempering and a ranking‑oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single‑forward‑pass baseline, achieving AUROC/AURC of 0.9838/2.4885, 0.9946/0.2660, and 0.9925/0.8193, respectively, while preserving segmentation quality. These results suggest that perturbation‑based uncertainty modeling is an effective and practical route to reliability‑aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.

Authors:Onno Niemann, Gonzalo Martínez Muñoz, Alberto Suárez Gonzalez
Title: An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation
Abstract:
Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker‑‑Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.

Authors:Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu, Enyan Dai
Title: Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization
Abstract:
Cost‑aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high‑capability models. Existing routing attacks depend on either white‑box access or heuristic prompts, rendering them ineffective in real‑world black‑box scenarios. In this work, we propose R^2A, which aims to mislead black‑box LLM routers to expensive models via adversarial suffix optimization. Specifically, R^2A deploys a hybrid ensemble surrogate router to mimic the black‑box router. A suffix optimization algorithm is further adapted for the ensemble‑based surrogate. Extensive experiments on multiple open‑source and commercial routing systems demonstrate that R^2A significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A‑Attack.

Authors:Gökçe Uludoğan, Buse Giledereli, Elif Ozkirimli, Arzucan Özgür
Title: PUFFIN: Protein Unit Discovery with Functional Supervision
Abstract:
Proteins carry out biological functions through the coordinated action of groups of residues organized into structural arrangements. These arrangements, which we refer to as protein units, exist at an intermediate scale, being larger than individual residues yet smaller than entire proteins. A deeper understanding of protein function can be achieved by identifying these units and their associations with function. However, existing approaches either focus on residue‑level signals, rely on curated annotations, or segment protein structures without incorporating functional information, thereby limiting interpretable analysis of structure‑function relationships. We introduce PUFFIN, a data‑driven framework for discovering protein units by jointly learning structural partitioning and functional supervision. PUFFIN represents proteins as residue‑level structure graphs and applies a graph neural network with a structure‑aware pooling mechanism that partitions each protein into multi‑residue units, with functional supervision that shapes the partition. We show that the learned units are structurally coherent, exhibit organized associations with molecular function, and show meaningful correspondence with curated InterPro annotations. Together, these results demonstrate that PUFFIN provides an interpretable framework for analyzing structure‑function relationships using learned protein units and their statistical function associations. We made our source code available at https://github.com/boun‑tabi‑lifelu/puffin.

Authors:Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, Donglin Wang
Title: World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
Abstract:
Vision‑Language‑Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long‑horizon trajectories and evaluate their consequences, which limits performance in complex decision‑making tasks. In this work, we introduce World‑Value‑Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long‑horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high‑value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent‑space inference reshapes the search distribution toward feasible regions, enabling efficient long‑horizon decision making. Extensive simulations and real‑world experiments demonstrate that the WAV model consistently outperforms state‑of‑the‑art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long‑horizon and compositional scenarios. Code is available at https://github.com/Win‑commit/WAV.

Authors:Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng
Title: Mean Flow Policy Optimization
Abstract:
Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few‑step flow‑based generative models, to improve training and inference efficiency over diffusion‑based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion‑based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.

Authors:Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons
Title: Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring
Abstract:
Clinical value set authoring ‑‑ the task of identifying all codes in a standardized vocabulary that define a clinical concept ‑‑ is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version‑controlled, and not reliably memorized during pretraining. We propose Retrieval‑Augmented Set Completion (RASC): retrieve the K most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve‑and‑select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large‑scale benchmark for this task. A cross‑encoder fine‑tuned on SAPBert achieves AUROC~0.852 and value‑set‑level F1~0.298, outperforming a simpler three‑layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval‑only) to approximately 3.2 and 4.4 respectively. Zero‑shot GPT‑4o achieves value‑set‑level F1~0.105, with 48.6% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross‑encoder initialized from pre‑trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \hrefhttps://github.com/mukhes3/RASChttps://github.com/mukhes3/RASC.

Authors:Mohammad R. Abu Ayyash
Title: Three-Phase Transformer
Abstract:
We present Three‑Phase Transformer (3PT), a residual‑stream structural prior for decoder‑only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally‑sized cyclic channels, each maintained by phase‑respecting ops: a per‑channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i(2pi/N), and a head‑count constraint aligning GQA heads with the partition. The architecture is a self‑stabilizing equilibrium between scrambling and re‑imposition, not a bolted‑on module. The partition carves out a one‑dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute‑position side‑channel composing orthogonally with RoPE's relative‑position rotation. The canonical N=3 borrows its metaphor from balanced three‑phase AC, where three sinusoids 120 degrees apart sum to zero with no anti‑correlated pair. At 123M parameters on WikiText‑103, 3PT achieves ‑7.20% perplexity (‑2.62% bits‑per‑byte) over a matched RoPE‑Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step‑count convergence speedup (1.64x wall‑clock). N behaves as a parameter‑sharing knob rather than a unique optimum: at 5.5M an N‑sweep over 1,2,3,4,6,8,12 is near‑monotone with N=1 winning; at 123M a three‑seed sweep finds N=3 and N=1 statistically indistinguishable. The load‑bearing mechanism is the channel‑partitioned residual stream, per‑block rotation, per‑phase normalization, and horn DC injection. We characterize (a) self‑stabilization of the geometry without explicit enforcement, a novel instance of the conservation‑law framework for neural networks; (b) a U‑shaped depth profile of rotation‑angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.

Authors:Avinash Amudala
Title: PROXIMA: A Reliability Scoring Framework for Proxy Metrics in Online Controlled Experiments
Abstract:
Online A/B testing at scale relies on proxy metrics ‑‑ short‑term, easily‑measured signals used in place of slow‑moving long‑term outcomes. When the proxy‑outcome relationship is heterogeneous across user segments, aggregate correlation can mask directional failures akin to Simpson's Paradox, leading to costly ship/no‑ship errors. We introduce PROXIMA (Proxy Metric Validation Framework for Online Experiments), a lightweight diagnostic framework that scores proxy reliability through a composite of three complementary dimensions: normalised effect correlation, directional accuracy, and segment‑level fragility rate. Unlike surrogate‑index approaches that predict long‑term treatment effects, PROXIMA directly audits whether a candidate proxy leads to correct launch decisions and flags the user segments where it fails. We validate PROXIMA on two public datasets ‑‑ the Criteo Uplift corpus (14M observations, advertising) and KuaiRec (7K users, video recommendation) ‑‑ using 80 simulated A/B tests. Early engagement metrics achieve a composite reliability of 0.80 on Criteo and 0.62 on KuaiRec, yielding 98.4% average decision agreement with an oracle policy. Fragility analysis reveals that recommendation domains exhibit substantially higher segment‑level heterogeneity (68% fragility) than advertising (13%), yet directional accuracy remains above 96% in both cases. A sensitivity analysis over the weight space confirms that no single component suffices and that the composite provides substantially better discrimination between reliable and unreliable proxies than correlation alone. Code and reproduction scripts are available at: https://github.com/Avinash‑Amudala/PROXIMA

Authors:Guillermo Valverde, Igor García-Olaizola, Giannicola Scarpa, Alejandro Pozas-Kerstjens
Title: Quantum-inspired tensor networks in machine learning models
Abstract:
Tensor networks were developed in the context of many‑body physics as compressed representations of multiparticle quantum states. These representations mitigate the exponential complexity of many‑body systems by capturing only the most relevant dependencies. Due to the formal similarity between quantum entanglement and statistical correlations, tensor networks have recently been integrated in machine learning, operating both as alternative learning architectures and as decompositions of components of neural networks. The expectation is that the theoretical understanding of tensor networks developed within quantum many‑body physics leads to novel methods that offer advantages in terms of computational efficiency, explainability, or privacy. Here we review the use of tensor networks in the context of machine learning, providing a critical assessment of the state of the art, the potential advantages, and the challenges that must be overcome.

Authors:Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, Amy Zhang
Title: Reinforcement Learning via Value Gradient Flow
Abstract:
We study behavior‑regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over‑optimization caused by erroneous out‑of‑distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior‑regularized RL. VGF casts behavior‑regularized RL as an optimal transport problem that maps the reference distribution to the value‑induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test‑time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state‑of‑the‑art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.

Authors:Qianyu Chen, Shujian Yu
Title: Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
Abstract:
Functional magnetic resonance imaging (fMRI) is widely used for studying and diagnosing brain disorders, with functional connectivity (FC) matrices providing powerful representations of large‑scale neural interactions. However, existing diagnostic models are trained either on a single site or under full multi‑site access, making them unsuitable for real‑world scenarios where clinical data arrive sequentially from different institutions. This results in limited generalization and severe catastrophic forgetting. This paper presents the first continual learning framework specifically designed for fMRI‑based diagnosis across heterogeneous clinical sites. Our framework introduces a structure‑aware variational autoencoder that synthesizes realistic FC matrices for both patient and control groups. Built on this generative backbone, we develop a multi‑level knowledge distillation strategy that aligns predictions and graph representations between new‑site data and replayed samples. To further enhance efficiency, we incorporate a hierarchical contextual bandit scheme for adaptive replay sampling. Experiments on multi‑site datasets for major depressive disorder (MDD), schizophrenia (SZ), and autism spectrum disorder (ASD) show that the proposed generative model enhances data augmentation quality, and the overall continual learning framework substantially outperforms existing methods in mitigating catastrophic forgetting. Our code is available at https://github.com/4me808/FORGE.

Authors:Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen
Title: Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
Abstract:
Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open‑source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while‑loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML‑based classifier, a five‑layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append‑oriented session storage. A comparison with OpenClaw, a multi‑channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per‑action safety classification to perimeter‑level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context‑window extensions to gateway‑wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.

Authors:Pu Cheng, Juncheng Liu, Yunshen Long
Title: PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
Abstract:
Predicting real‑world events from live market signals demands systems that fuse qualitative news with quantitative order‑book dynamics under strict temporal discipline ‑‑ a challenge existing benchmarks fail to capture. We present PolyBench, a multimodal benchmark derived from Polymarket that records point‑in‑time cross‑sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real‑time news stream. Using PolyBench, we evaluate seven state‑of‑the‑art Large Language Models ‑‑ spanning open‑ and closed‑source families ‑‑ generating 36,165 predictions under identical, timestamp‑locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence‑Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio via realistic order‑book execution simulation. The results reveal a pronounced performance divergence: only two of seven models achieve positive financial returns ‑‑ MiMo‑V2‑Flash at 17.6% CWR and Gemini‑3‑Flash at 6.2% CWR ‑‑ while the remaining five incur losses despite uniformly high stated confidence. These findings highlight the gap between surface‑level language fluency and genuine probabilistic reasoning under live market uncertainty, and establish PolyBench as a contamination‑proof, financially‑grounded evaluation standard for future LLM research. Our dataset and code available at \underline\hrefhttps://github.com/PolyBench/PolyBenchhttps://github.com/PolyBench/PolyBench.

Authors:Haiyang Zheng, Nan Pu, Yaqi Cai, Teng Long, Wenjing Li, Nicu Sebe, Zhun Zhong
Title: The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
Abstract:
Generalized Category Discovery (GCD) leverages labeled data to categorize unlabeled samples from known or unknown classes. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue, i.e., gradient entanglement, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation‑subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy‑Aware Gradient Coordinator (EAGC), a plug‑and‑play gradient‑level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor‑based Gradient Alignment (AGA) and Energy‑aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known‑class subspace and derives an energy‑based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. Experiments show that EAGC consistently boosts existing methods and establishes new state‑of‑the‑art results. Code is available at https://haiyangzheng.github.io/EAGC.

Authors:Bryan Sanchez
Title: Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
Abstract:
Alignment‑tuned language models frequently suppress factual log‑probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K‑parameter (approximately 0.02% of the base model) post‑transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology‑discriminating facts across Qwen3‑4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11‑‑39% of 16 held‑out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log‑probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last‑position‑only), the adapter produces coherent, less censored text. A logit‑space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden‑state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.

Authors:Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu
Title: From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
Abstract:
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre‑train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre‑training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre‑train Space RL), which applies reward‑driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR‑PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR‑PreRL to expand the reasoning horizon before transitioning to standard RL for fine‑grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre‑train space pruning effectively steers the policy toward a refined correct reasoning subspace.

Authors:Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov
Title: From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
Abstract:
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real‑world usefulness. Instead, users often rely on ``vibe‑testing'': informal experience‑based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe‑testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe‑testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in‑the‑wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe‑testing as a two‑part process: users personalize both what they test and how they judge responses. We then introduce a proof‑of‑concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user‑aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user‑aware evaluation can change which model is preferred, reflecting the role of vibe‑testing in practice. These findings suggest that formalized vibe‑testing can serve as a useful approach for bridging benchmark scores and real‑world experience.

Authors:Xiaofan Zhou, Kyumin Lee
Title: ID and Graph View Contrastive Learning with Multi-View Attention Fusion for Sequential Recommendation
Abstract:
Sequential recommendation has become increasingly prominent in both academia and industry, particularly in e‑commerce. The primary goal is to extract user preferences from historical interaction sequences and predict items a user is likely to engage with next. Recent advances have leveraged contrastive learning and graph neural networks to learn more expressive representations from interaction histories ‑‑ graphs capture relational structure between nodes, while ID‑based representations encode item‑specific information. However, few studies have explored multi‑view contrastive learning between ID and graph perspectives to jointly improve user and item representations, especially in settings where only interaction data is available without auxiliary information. To address this gap, we propose Multi‑View Contrastive learning for sequential recommendation (MVCrec), a framework that integrates complementary signals from both sequential (ID‑based) and graph‑based views. MVCrec incorporates three contrastive objectives: within the sequential view, within the graph view, and across views. To effectively fuse the learned representations, we introduce a multi‑view attention fusion module that combines global and local attention mechanisms to estimate the likelihood of a target user purchasing a target item. Comprehensive experiments on five real‑world benchmark datasets demonstrate that MVCrec consistently outperforms 11 state‑of‑the‑art baselines, achieving improvements of up to 14.44% in NDCG@10 and 9.22% in HitRatio@10 over the strongest baseline. Our code and datasets are available at https://github.com/sword‑Lz/MMCrec.

Authors:Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard
Title: TIP: Token Importance in On-Policy Distillation
Abstract:
On‑policy knowledge distillation (OPD) trains a student on its own rollouts under token‑level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher‑‑student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first‑order proxy: retaining 50% of tokens with entropy‑based sampling matches or exceeds all‑token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low‑entropy, high‑divergence tokens, training on fewer than 10% of all tokens nearly matches full‑token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy‑only rules. We organize these findings with TIP (Token Importance in on‑Policy distillation), a two‑axis taxonomy over student entropy and teacher‑‑student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type‑aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher‑‑student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH‑500 and AIME 2024/2025, and on the DeepPlanning benchmark for long‑horizon agentic planning, where Q3‑only training on <20% of tokens surpasses full‑token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory‑efficient distillation of larger models under limited GPU budgets.

Authors:Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, J. Marius Zöllner
Title: Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation
Abstract:
Sparse mixture‑of‑experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed‑forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine‑grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch‑wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder‑decoder and backbone‑based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture‑dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN‑based dense prediction. Our code is available at https://github.com/KASTEL‑MobilityLab/moe‑layers/.

Authors:Boxuan Jiang, Chenyun Dai, Can Han
Title: EMGFlow: Robust and Efficient Surface Electromyography Synthesis via Flow Matching
Abstract:
Deep learning‑based surface electromyography (sEMG) gesture recognition is frequently bottlenecked by data scarcity and limited subject diversity. While synthetic data generation via Generative Adversarial Networks (GANs) and diffusion models has emerged as a promising augmentation strategy, these approaches often face challenges regarding training stability or inference efficiency. To bridge this gap, we propose EMGFlow, a conditional sEMG generation framework. To the best of our knowledge, this is the first study to investigate the application of Flow Matching (FM) and continuous‑time generative modeling in the sEMG domain. To validate EMGFlow across three benchmark sEMG datasets, we employ a unified evaluation protocol integrating feature‑based fidelity, distributional geometry, and downstream utility. Extensive evaluations show that EMGFlow outperforms conventional augmentation and GAN baselines, and provides stronger standalone utility than the diffusion baselines considered here under the train‑on‑synthetic test‑on‑real (TSTR) protocol. Furthermore, by optimizing generation dynamics through advanced numerical solvers and targeted time sampling, EMGFlow achieves improved quality‑efficiency trade‑offs. Taken together, these results suggest that Flow Matching is a promising and efficient paradigm for addressing data bottlenecks in myoelectric control systems. Our code is available at: https://github.com/Open‑EXG/EMGFlow.

Authors:Zijian Zhao, Jing Gao, Sen Li
Title: Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
Abstract:
Cooperative multi‑agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non‑stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi‑Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single‑agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision‑making mechanism in which a Transformer decoder autoregressively generates a high‑level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order‑independent joint decision making and avoiding the sensitivity to action‑generation order in conventional Multi‑Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single‑agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi‑Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .

Authors:Mohammed Ezzaldin Babiker Abdullah
Title: Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps
Abstract:
Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi‑sensor spatial correlations and long‑range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety‑critical error of over‑estimating residual life. This study proposes a hybrid architecture integrating Twin‑Stage One‑Dimensional Convolutional Neural Networks (1D‑CNN), a Bidirectional Long Short‑Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero‑Propulsion System Simulation (C‑MAPSS) FD001 sub‑dataset employing a zero‑leakage preprocessing pipeline, piecewise‑linear RUL labeling capped at 130 cycles, and the NASA‑specified asymmetric exponential loss function that disproportionately penalizes over‑estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S‑Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per‑engine insights into the temporal progression of degradation, supporting informed maintenance decision‑making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.

Authors:Mohammed Ezzaldin Babiker Abdullah, Rufaidah Abdallah Ibrahim Mohammed
Title: Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks
Abstract:
Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer‑based architectures, this paper challenges the prevailing "complexity‑first" paradigm. We propose a lightweight, Physics‑Informed Hybrid CNN‑BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi‑Directional LSTM for capturing temporal dependencies. Unlike standard data‑driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear‑Sky indices and Solar Zenith Angle ‑ rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics‑guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention‑based baselines (RMSE 30.64 W/m^2). These results confirm a "Complexity Paradox": in high‑noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self‑attention mechanisms. The findings advocate for a shift towards hybrid, physics‑aware AI for real‑time renewable energy management.

Authors:Jason Kong, Nilesh Prasad Pandey, Flavio Ponzina, Tajana Rosing
Title: A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
Abstract:
Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real‑time processing and on‑device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer‑based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation‑free, surrogate‑based sensitivity analysis framework to identify hybrid SSM‑Transformer components most susceptible to quantization‑induced degradation. Relying solely on forward‑pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in‑domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback‑Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal‑to‑quantization‑noise ratio (SQNR). Through extensive experiments on SSM and hybrid architectures, our ablation studies confirm that KL‑based rankings align with observed performance drops and outperform alternative metrics. This framework enables the practical deployment of advanced hybrid models on resource‑constrained edge devices with minimal accuracy loss. We further validate our approach with real‑world on‑device profiling on Intel Lunar Lake hardware, demonstrating that KL‑guided mixed‑precision achieves near‑FP16 perplexity with model sizes and throughput competitive with Uniform INT4 on both CPU and GPU execution modes. Code is available at https://github.com/jasonkongie/kl‑ssm‑quant.

Authors:Yiping Li, Zhiyu An, Wan Du
Title: When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration
Abstract:
Communication in Large Language Model (LLM)‑based multi‑agent systems is moving beyond discrete tokens to preserve richer context. Recent work such as LatentMAS enables agents to exchange latent messages through full key‑value (KV) caches. However, full KV relay incurs high memory and communication cost. We adapt eviction‑style KV compression to this setting and introduce Orthogonal Backfill (OBF) to mitigate information loss from hard eviction. OBF injects a low‑rank orthogonal residual from discarded KV states into the retained KV states. We evaluate proposed method against full KV relay on nine standard benchmarks spanning mathematical reasoning, coding, and knowledge‑intensive QA. It achieves performance comparable to full KV relay while reducing communication cost by 79.8%‑‑89.4%. OBF further improves the performance and achieves the best results on 7 of the 9 benchmarks. This suggests that more information does not necessarily lead to better communication; preserving the most useful information matters more. Our codebase is publicly available on https://github.com/markli404/When‑Less‑Latent‑Leads‑to‑Better‑Relay.

Authors:Yilang Zhang, Abraham Jaeger Mountain, Bingcong Li, Georgios B. Giannakis
Title: Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation
Abstract:
Meta‑learning offers a principled framework leveraging \emphtask‑invariant priors from related tasks, with which \emphtask‑specific models can be fine‑tuned on downstream tasks, even with limited data records. Gradient‑based meta‑learning (GBML) relies on gradient descent (GD) to adapt the prior to a new task. Albeit effective, these methods incur high computational overhead that scales linearly with the number of GD steps. To enhance efficiency and scalability, existing methods approximate the gradient of prior parameters (meta‑gradient) via truncated backpropagation, yet suffer large approximation errors. Targeting accurate approximation, this work puts forth binomial GBML (BinomGBML), which relies on a truncated binomial expansion for meta‑gradient estimation. This novel expansion endows more information in the meta‑gradient estimation via efficient parallel computation. As a running paradigm applied to model‑agnostic meta‑learning (MAML), the resultant BinomMAML provably enjoys error bounds that not only improve upon existing approaches, but also decay super‑exponentially under mild conditions. Numerical tests corroborate the theoretical analysis and showcase boosted performance with slightly increased computational overhead.

Authors:Bhavana Sajja
Title: Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals
Abstract:
We introduce behavioral fidelity ‑‑ a third evaluation dimension for synthetic tabular data that measures whether generated data preserves the temporal, sequential, and structural behavioral patterns that distinguish real‑world entity activity. Existing frameworks evaluate statistical fidelity (marginal distributions and correlations) and downstream utility (classifier AUROC on synthetic‑trained models), but neither tests for the behavioral signals that operational detection and analysis systems actually rely on. We formalize a taxonomy of four behavioral fraud patterns (P1‑P4) covering inter‑event timing, burst structure, multi‑account graph motifs, and velocity‑rule trigger rates; define a degradation ratio metric calibrated to a real‑data noise floor (1.0 = matches real variability, k = k‑times worse); and prove that row‑independent generators ‑‑ the dominant paradigm ‑‑ are structurally incapable of reproducing P3 graph motifs (Proposition 1) and produce non‑positive within‑entity IET autocorrelation (Proposition 2), making the positive burst fingerprint of fraud sequences unachievable regardless of architecture or training data size. We benchmark CTGAN, TVAE, GaussianCopula, and TabularARGN on IEEE‑CIS Fraud Detection and the Amazon Fraud Dataset. All four fail severely: on IEEE‑CIS composite degradation ratios range from 24.4x (TVAE) to 39.0x (GaussianCopula); on Amazon FDB, row‑independent generators score 81.6‑99.7x, while TabularARGN achieves 17.2x. We document generator‑specific failure modes and their resolutions. The P1‑P4 framework extends to any domain with entity‑level sequential tabular data, including healthcare and network security. We release our evaluation framework as open source.

Authors:Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang
Title: LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
Abstract:
LLM‑based agents are increasingly expected to handle real‑world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real‑world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple‑Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity‑factor annotations, covering real‑world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi‑AI/LiveClawBench.

Authors:Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding
Title: Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Abstract:
On‑policy distillation (OPD) has become a core technique in the post‑training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak‑to‑strong reverse distillation, showing that same‑family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token‑level mechanism, we show that successful OPD is characterized by progressive alignment on high‑probability tokens at student‑visited states, a small shared token set that concentrates most of the probability mass (97%‑99%). We further propose two practical strategies to recover failing OPD: off‑policy cold start and teacher‑aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token‑level reward comes at a cost, raising the question of whether OPD can scale to long‑horizon distillation.

Authors:Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu, Yabiao Wang, Yong Liu, Shuicheng Yan
Title: Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
Abstract:
Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First‑order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large‑scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second‑order optimization techniques to surpass first‑order performance ceilings, while zeroth‑order methods reemerge to alleviate memory constraints inherent to large‑scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade‑offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next‑generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL‑AIGC/Awesome‑Optimizer.

Authors:Jason Z Wang
Title: The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime
Abstract:
The most cited calibration result in deep learning ‑‑ post‑temperature‑scaling ECE of 0.012 on CIFAR‑100 (Guo et al., 2017) ‑‑ is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^1/3), and no estimator can beat it. This "verification tax" implies that as AI models improve, verifying their calibration becomes fundamentally harder ‑‑ with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self‑evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC‑Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B‑405B parameters, 27 benchmark‑model pairs with logprob‑based confidence), 95% bootstrap CIs, and permutation tests. Self‑evaluation non‑significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.

Authors:Shaopeng Fu, Di Wang
Title: Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
Abstract:
Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in‑context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in‑context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real‑world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness‑utility tradeoff. The code is available at https://github.com/fshp971/continuous‑adv‑icl.

Authors:Arya Shah, Kaveri Visavadiya, Manisha Padala
Title: GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees
Abstract:
Adversarial robustness is essential for deploying neural networks in safety‑critical applications, yet standard evaluation methods either require expensive adversarial attacks or report only a single aggregate score that obscures how robustness is distributed across classes. We introduce the \emphGF‑Score (GREAT‑Fairness Score), a framework that decomposes the certified GREAT Score into per‑class robustness profiles and quantifies their disparity through four metrics grounded in welfare economics: the Robustness Disparity Index (RDI), the Normalized Robustness Gini Coefficient (NRGC), Worst‑Case Class Robustness (WCR), and a Fairness‑Penalized GREAT Score (FP‑GREAT). The framework further eliminates the original method's dependence on adversarial attacks through a self‑calibration procedure that tunes the temperature parameter using only clean accuracy correlations. Evaluating 22 models from RobustBench across CIFAR‑10 and ImageNet, we find that the decomposition is exact, that per‑class scores reveal consistent vulnerability patterns (e.g., ``cat'' is the weakest class in 76% of CIFAR‑10 models), and that more robust models tend to exhibit greater class‑level disparity. These results establish a practical, attack‑free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally. We release our code on \hrefhttps://github.com/aryashah2k/gf‑scoreGitHub.

Authors:Michele De Vita, Julian Wiederer, Vasileios Belagiannis
Title: Forecasting the Past: Gradient-Based Distribution Shift Detection in Trajectory Prediction
Abstract:
Trajectory prediction models often fail in real‑world automated driving due to distributional shifts between training and test conditions. Such distributional shifts, whether behavioural or environmental, pose a critical risk by causing the model to make incorrect forecasts in unfamiliar situations. We propose a self‑supervised method that trains a decoder in a post‑hoc fashion on the self‑supervised task of forecasting the second half of observed trajectories from the first half. The L2 norm of the gradient of this forecasting loss with respect to the decoder's final layer defines a score to identify distribution shifts. Our approach, first, does not affect the trajectory prediction model, ensuring no interference with original prediction performance and second, demonstrates substantial improvements on distribution shift detection for trajectory prediction on the Shifts and Argoverse datasets. Moreover, we show that this method can also be used to early detect collisions of a deep Q‑Network motion planner in the Highway simulator. Source code is available at https://github.com/Michedev/forecasting‑the‑past.

Authors:Yuangang Li, Justin Tian Jin Chen, Ethan Yu, David Hong, Iftekhar Ahmed
Title: Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
Abstract:
Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks focus primarily on code generation, leaving other coding tasks largely unexplored. We introduce CodeRQ‑Bench, the first benchmark for evaluating LLM reasoning quality across three coding task categories: generation, summarization, and classification. Using this benchmark, we analyze 1,069 mismatch cases from existing evaluators, identify five recurring limitations, and derive four design insights for reasoning evaluation in coding tasks. Guided by these insights, we propose VERA, a two‑stage evaluator that combines evidence‑grounded verification with ambiguity‑aware score correction. Experiments on CodeRQ‑Bench show that VERA consistently outperforms strong baselines across four datasets, improving AUCROC by up to 0.26 and AUPRC by up to 0.21. We release CodeRQ‑Bench at https://github.com/MrLYG/CodeRQ‑Bench, supporting future investigations.

Authors:Yexiong Lin, Jia Shi, Shanshan Ye, Wanyu Wang, Yu Yao, Tongliang Liu
Title: SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
Abstract:
Flow matching has emerged as a powerful generative framework, with recent few‑step methods achieving remarkable inference acceleration. However, we identify a critical yet overlooked limitation: these models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations of the target distribution. We trace this degradation to averaging distortion: when trained with MSE objectives, class‑conditional flows learn a frequency‑weighted mean over intra‑class sub‑modes, causing the model to over‑represent high‑density modes while systematically neglecting low‑density ones. To address this, we propose SubFlow, Sub‑mode Conditioned Flow Matching, which eliminates averaging distortion by decomposing each class into fine‑grained sub‑modes via semantic clustering and conditioning the flow on sub‑mode indices. Each conditioned sub‑distribution is approximately unimodal, so the learned flow accurately targets individual modes with no averaging distortion, restoring full mode coverage in a single inference step. Crucially, SubFlow is entirely plug‑and‑play: it integrates seamlessly into existing one‑step models such as MeanFlow and Shortcut Models without any architectural modifications. Extensive experiments on ImageNet‑256 demonstrate that SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID), confirming its broad applicability across different one‑step generation frameworks. Project page: https://yexionglin.github.io/subflow.

Authors:Sandra Gómez-Gálvez, Tobias Olenyi, Gillian Dobbie, Katerina Taškova
Title: Socrates Loss: Unifying Confidence Calibration and Classification by Leveraging the Unknown
Abstract:
Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high‑stakes applications. Current ad‑hoc confidence calibration methods attempt to fix this during training but face a fundamental trade‑off: two‑phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single‑loss methods are stable but underperform in classification. This paper addresses and mitigates this stability‑performance trade‑off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy‑calibration trade‑off, often converging faster than existing methods.

Authors:Ziqing Wang, Yibo Wen, Abhishek Pandy, Han Liu, Kaize Ding
Title: MolMem: Memory-Augmented Agentic Reinforcement Learning for Sample-Efficient Molecular Optimization
Abstract:
In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial‑and‑error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long‑term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (Molecular optimization with Memory), a multi‑turn agentic reinforcement learning (RL) framework with a dual‑memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold‑start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory‑augmented formulation, we train the policy with dense step‑wise rewards, turning costly rollouts into long‑term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90% success on single‑property tasks (1.5× over the best baseline) and 52% on multi‑property tasks using only 500 oracle calls. Our code is available at https://github.com/REAL‑Lab‑NU/MolMem.

Authors:Disha Patel
Title: LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics
Abstract:
System log anomaly detection is critical for maintaining the reliability of large‑scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM‑based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM‑based and traditional approaches for log anomaly detection across four widely‑used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine‑tuned transformer models (BERT, RoBERTa), and (3) prompt‑based LLM approaches (GPT‑3.5, GPT‑4, LLaMA‑3) in zero‑shot and few‑shot settings. Our experiments reveal that while fine‑tuned transformers achieve the highest F1‑scores (0.96‑0.99), prompt‑based LLMs demonstrate remarkablezero‑shot capabilities (F1: 0.82‑0.91) without requiring any labeled training data ‑‑ a significant advantage for real‑world deployment where labeled anomalies are scarce. We further analyze the cost‑accuracy trade‑offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.

Authors:Md Tanvirul Alam
Title: Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models
Abstract:
Large vision‑language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule‑mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM‑Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic‑fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse‑rule gap, while semantically loaded aliases reopen it. Post‑training is strongly rule‑aligned: training on one rule improves same‑rule transfer but hurts opposite‑rule transfer, while joint‑rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late‑layer activation steering partially recovers degraded performance, indicating that semantic‑fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm‑fix/.

Authors:Arun Sharma
Title: Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks
Abstract:
We introduce compute‑grounded reasoning (CGR), a design paradigm for spatial‑aware research agents in which every answerable sub‑problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent‑to‑Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question‑answering benchmark spanning factory, warehouse, and retail environments, and MLE‑Bench, a suite of 75 Kaggle machine learning competitions requiring end‑to‑end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy‑guided action selection maximizes information gain per step and routes queries across a three‑tier frontier model stack (OpenAI + Anthropic). A self‑healing ML pipeline with strategy‑aware code generation, a score‑driven iterative refinement loop, and a prompt‑based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.

Authors:Zixuan Liu, Xiaolin Sun, Zizhan Zheng
Title: Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
Abstract:
Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r‑correlation between proxy and true rewards, but existing methods like occupancy‑regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r‑correlated proxy rewards. We derive a tractable max‑min formulation, where the agent maximizes performance under the worst‑case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst‑case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst‑case returns, and offer improved robustness and stability across different levels of proxy‑true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at https://github.com/ZixuanLiu4869/reward_hacking.

Authors:Vladimir Vasilenko
Title: Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
Abstract:
Large language models map semantically related prompts to similar internal representations ‑‑ a phenomenon interpretable as attractor‑like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor‑like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean‑pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen's d > 1.88, p < 10^‑27, Bonferroni‑corrected). Replication on Gemma 2 9B confirms cross‑architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An exploratory experiment shows that reading a scientific description of the agent shifts internal state toward the attractor ‑‑ closer than a sham preprint ‑‑ distinguishing knowing about an identity from operating as that identity. These results provide representational evidence that agent identity documents induce attractor‑like geometry in LLM activation space.

Authors:Jiayi Xin, Xiang Li, Evan Qiang, Weiqing He, Tianqi Shang, Weijie J. Su, Qi Long
Title: UCS: Estimating Unseen Coverage for Improved In-Context Learning
Abstract:
In‑context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training‑free, subset‑level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model‑consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good‑‑Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage‑based and training‑free, and can be seamlessly combined with both query‑dependent and query‑independent selection baselines via a simple regularized objective. Experiments on multiple intent‑classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2‑6% under the same selection budget, while also yielding insights into task‑ and model‑level latent cluster distributions. Code is available at https://github.com/Raina‑Xin/UCS.

Authors:Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk
Title: The Linear Centroids Hypothesis: How Deep Network Features Represent Data
Abstract:
Identifying and understanding the features that a deep network (DN) extracts from its inputs to produce its outputs is a focal point of interpretability research. The Linear Representation Hypothesis (LRH) identifies features in terms of the linear directions formed by the inputs in a DN's latent space. However, the LRH is limited as it abstracts away from individual components (e.g., neurons and layers), is susceptible to identifying spurious features, and cannot be applied across sub‑components (e.g., multiple layers). In this paper, we introduce the Linear Centroids Hypothesis (LCH) as a new framework for identifying the features of a DN. The LCH posits that features correspond to linear directions of centroids, which are vector summarizations of the functional behavior of a DN in a local region of its input space. Interpretability studies under the LCH can leverage existing LRH tools, such as sparse autoencoders, by applying them to the DN's centroids rather than to its latent activations. We demonstrate that doing so yields sparser feature dictionaries for DINO vision transformers, which also perform better on downstream tasks. The LCH also inspires novel approaches to interpretability; for example, LCH can readily identify circuits in GPT2‑Large. For code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .

Authors:Elliott C. Pryor, Marc D. Breton, Anas El Fathi
Title: A unified data format for managing diabetes time-series data: DIAbetes eXchange (DIAX)
Abstract:
Diabetes devices, including Continuous Glucose Monitoring (CGM), Smart Insulin Pens, and Automated Insulin Delivery systems, generate rich time‑series data widely used in research and machine learning. However, inconsistent data formats across sources hinder sharing, integration, and analysis. We present DIAX (DIAbetes eXchange), a standardized JSON‑based format for unifying diabetes time‑series data, including CGM, insulin, and meal signals. DIAX promotes interoperability, reproducibility, and extensibility, particularly for machine learning applications. An open‑source repository provides tools for dataset conversion, cross‑format compatibility, visualization, and community contributions. DIAX is a translational resource, not a data host, ensuring flexibility without imposing data‑sharing constraints. Currently, DIAX is compatible with other standardization efforts and supports major datasets (DCLP3, DCLP5, IOBP2, PEDAP, T1Dexi, Loop), totaling over 10 million patient‑hours of data. https://github.com/Center‑for‑Diabetes‑Technology/DIAX

Authors:Mohammed Ezzaldin Babiker Abdullah
Title: Thermodynamic Liquid Manifold Networks: Physics-Bounded Deep Learning for Solar Forecasting in Autonomous Off-Grid Microgrids
Abstract:
The stable operation of autonomous off‑grid photovoltaic systems requires solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data‑driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The methodology projects 22 meteorological and geometric variables into a Koopman‑linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha‑Gate. This system synthesizes real‑time atmospheric opacity with theoretical clear‑sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero‑lag synchronization during rapid weather shifts. Validated against a rigorous five‑year testing horizon in a severe semi‑arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero‑magnitude nocturnal error across all 1826 testing days and exhibits a sub‑30‑minute phase response during high‑frequency optical transients. Comprising exactly 63,458 trainable parameters, this ultra‑lightweight design establishes a robust, thermodynamically consistent standard for edge‑deployable microgrid controllers.

Authors:Wenhao Zhang, Lin Mu, Li Ni, Peiquan Jin, Yiwen Zhang
Title: Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions
Abstract:
Low‑rank adaptation (LoRA) is a widely used strategy for efficient fine‑tuning of large language models (LLMs), but its strictly linear structure fundamentally limits expressive capacity. The bilinear formulation of weight updates captures only first‑order dependencies between low‑rank factors, restricting the modeling of nonlinear and higher‑order parameter interactions. In this paper, we propose Polynomial Expansion Rank Adaptation (PERA), a novel method that introduces structured polynomial expansion directly into the low‑rank factor space. By expanding each low‑rank factor to synthesize high‑order interaction terms before composition, PERA transforms the adaptation space into a polynomial manifold capable of modeling richer nonlinear coupling without increasing rank or inference cost. We provide theoretical analysis demonstrating that PERA offers enhanced expressive capacity and more effective feature utilization compare to existing linear adaptation approaches. Empirically, PERA consistently outperforms state‑of‑the‑art methods across diverse benchmarks. Notably, our experiments show that incorporating high‑order nonlinear components particularly square terms is crucial for enhancing expressive capacity and maintaining strong and robust performance under various rank settings. Our code is available at https://github.com/zhangwenhao6/PERA

Authors:Mohammed Ezzaldin Babiker Abdullah
Title: Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems
Abstract:
The stable operation of off‑grid photovoltaic systems requires accurate, computationally efficient solar forecasting. Contemporary deep learning models often suffer from massive computational overhead and physical blindness, generating impossible predictions. This paper introduces the Physics‑Informed State Space Model (PISSM) to bridge the gap between efficiency and physical accuracy for edge‑deployed microcontrollers. PISSM utilizes a dynamic Hankel matrix embedding to filter stochastic sensor noise by transforming raw meteorological sequences into a robust state space. A Linear State Space Model replaces heavy attention mechanisms, efficiently modeling temporal dependencies for parallel processing. Crucially, a novel Physics‑Informed Gating mechanism leverages the Solar Zenith Angle and Clearness Index to structurally bound outputs, ensuring predictions strictly obey diurnal cycles and preventing nocturnal errors. Validated on a multi‑year dataset for Omdurman, Sudan, PISSM achieves superior accuracy with fewer than 40,000 parameters, establishing an ultra‑lightweight benchmark for real‑time off‑grid control.

Authors:Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu
Title: LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
Abstract:
Continuous diffusion has been the foundation of high‑fidelity, controllable, and few‑step generation of many data modalities such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts due to the sparse data space and the underexplored design space. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion, by connecting embedding‑space DLMs to Flow Matching via Bregman divergence, alongside three key innovations: (1) we derive a novel ODE‑based NLL bound for principled evaluation of continuous flow‑based language models; (2) we propose an information‑uniform principle for setting the noise schedule, which motivates a learnable noise scheduler based on a Gumbel distribution; and (3) we revise prior training protocols by incorporating self‑conditioning, as we find it improves both likelihood and sample quality of embedding‑space DLMs with effects substantially different from discrete diffusion. Putting everything together, LangFlow rivals top discrete DLMs on both the perplexity (PPL) and the generative perplexity (Gen. PPL), reaching a PPL of 30.0 on LM1B and 24.6 on OpenWebText. It even exceeds autoregressive baselines in zero‑shot transfer on 4 out of 7 benchmarks. LangFlow provides the first clear evidence that continuous diffusion is a promising paradigm for language modeling. Homepage: https://github.com/nealchen2003/LangFlow

Authors:Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen
Title: RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
Abstract:
Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi‑dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine‑grained rewards for reinforcement learning; at test time, a Generate‑Critique‑Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference‑Anchored Rationalization (PARROT), a principled framework that recovers high‑quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state‑of‑the‑art preference prediction among open‑source reward models, competitive with Gemini‑2.5‑Pro, while using 10‑20x less training data than comparable baselines. As an RL reward, it consistently improves text‑to‑image and image‑editing generators beyond scalar alternatives. Most strikingly, its test‑time critique‑and‑refine loop matches or exceeds RL‑based fine‑tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

Authors:Denizalp Goktas, Gerardo Riaño-Briceño, Alif Abdullah, Aryan Nair, Chenkai Shen, Beatriz de Lucio, Alexandra Magnusson, Farhan Mashrur, Ahmed Abdulla, Shawrna Sen, Mahitha Thippireddy, Gregory Schwartz, Amy Greenwald
Title: TempusBench: An Evaluation Framework for Time-Series Forecasting
Abstract:
Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time‑series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open‑source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community‑accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, existing evaluation frameworks comprise benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre‑train TSFMs. Second, these frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks, such as forecast horizon length or domain, but overlook core statistical properties such as non‑stationarity and seasonality. Third, domain‑specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks do not enforce a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open‑source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard‑based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench and maintain a live leaderboard at https://benchmark.smlcrm.com/.

Authors:Prateek Chanda, Prayas Agrawal, Karthik S. Gurumoorthy, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria
Title: UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees
Abstract:
Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low‑quality prototypes for minority classes. We present \methodprop, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality‑constrained maximization of a \emphsuper‑additive objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport‑based submodular objective. We prove that this reformulation enables a greedy algorithm with a (1‑1/e) approximation guarantee relative to the original super‑additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority‑class representation in imbalanced classification benchmarks without compromising majority‑class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform‑weighted prototype selection. Our code is publicly available at GitHub\footnoteCode: https://github.com/efficiency‑learning/UniPROT

Authors:Dheeraj Mudireddy, Sai Patibandla
Title: PokeRL: Reinforcement Learning for Pokemon Red
Abstract:
Pokemon Red is a long‑horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player's house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop‑aware environment wrapper around the PyBoy emulator with map masking, a multi‑layer anti‑loop and anti‑spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at https://github.com/reddheeraj/PokemonRL

Authors:Chirag Shinde
Title: Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V
Abstract:
We propose two complementary modifications to transformer attention blocks. First, a non‑linear pre‑projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position‑agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre‑projection's features around the attention mechanism, allowing content information to bypass position‑aware attention where beneficial. In frozen‑probe experiments on Pythia‑160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and ‑39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.

Authors:Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi
Title: Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Abstract:
Reinforcement learning (RL) has been widely used to train LLM agents for multi‑turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On‑policy self‑distillation (OPSD) alleviates this by providing dense token‑level supervision from a privileged teacher that has access to ground‑truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill‑SD, a framework that turns the agent's own trajectories into dynamic training‑only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance‑weighted reverse‑KL loss to provide gradient‑correct token‑level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill‑SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill‑sd/

Authors:Luis Balderas, Miguel Lastra, José M. Benítez
Title: MoEITS: A Green AI approach for simplifying MoE-LLMs
Abstract:
Large language models are transforming all areas of academia and industry, attracting the attention of researchers, professionals, and the general public. In the trek for more powerful architectures, Mixture‑of‑Experts, inspired by ensemble models, have emerged as one of the most effective ways to follow. However, this implies a high computational burden for both training and inference. To reduce the impact on computing and memory footprint as well as the energy consumption, simplification methods has arisen as very effective procedures. In this paper, an original algorithm, MoEITS, for MoE‑LLMs simplification is presented. The algorithm is characterized by a refined simplicity, underpinned by standardized Information Theoretic frameworks. MoEITS is analyzed in depth from theoretical and practical points of view. Its computational complexity is studied. Its performance on the accuracy of the simplified LLMs and the reduction rate achieved is assessed through a thoroughly designed experimentation. This empirical evaluation includes a comparison with state‑of‑the‑art MoE‑LLM pruning methods applied on Mixtral 8×7B, Qwen1.5‑2.7B, and DeepSeek‑V2‑Lite. The extensive experimentation conducted demonstrates that MoEITS outperforms state‑of‑the‑art techniques by generating models that are both effective across all benchmarks and computationally efficient. The code implementing the method will be available at https://github.com/luisbalru/MoEITS.

Authors:Yuzhen Mao, Qitong Wang, Martin Ester, Ke Li
Title: IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
Abstract:
Key‑Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource‑constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long‑generation tasks such as chain‑of‑thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU‑GPU transfers. Experimental results on LongBench show that, with a 256‑token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading‑based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long‑sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.

Authors:Jiahui Zhang, Rouyi Wang, Kuangqi Zhou, Tianshu Xiao, Lingyan Zhu, Yaosen Min, Yang Wang
Title: PepBenchmark: A Standardized Benchmark for Peptide Machine Learning
Abstract:
Peptide therapeutics are widely regarded as the "third generation" of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well‑curated collection comprising 29 canonical‑peptide and 6 non‑canonical‑peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI‑ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint‑based, GNN‑based, PLM‑based, and SMILES‑based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real‑world applications. The data and code are publicly available at https://github.com/ZGCI‑AI4S‑Pep/PepBenchmark/.

Authors:Xiangyang Yin, Xingyu Liu, Tianhua Xia, Bo Bao, Vithursan Thangarasa, Valavan Manohararajah, Eric Sather, Sai Qian Zhang
Title: CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts
Abstract:
Outliers have emerged as a fundamental bottleneck in preserving accuracy for low‑precision large models, particularly within Mixture‑of‑Experts (MoE) architectures that are increasingly central to large‑scale language modeling. Under post‑training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation‑based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low‑precision deployment. In this work, we tackle this challenge by introducing CodeQuant, a unified quantization‑and‑clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine‑tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to 4.15× speedup while delivering significantly higher accuracy than state‑of‑the‑art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE‑based large language models under low‑precision constraints. Our code is available at https://github.com/SAI‑Lab‑NYU/CodeQuant.

Authors:Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yinzhe Zhou
Title: Towards Green Wearable Computing: A Physics-Aware Spiking Neural Network for Energy-Efficient IMU-based Human Activity Recognition
Abstract:
Wearable IMU‑based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power‑hungry floating‑point operations and rigid requirement to process complete temporal windows severely cripple battery‑constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event‑driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics‑Aware Spiking Neural Network (PAS‑Net), a fully multiplier‑free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human‑joint physical constraints. Temporally, an O(1)‑memory causal neuromodulator yields context‑aware dynamic threshold neurons, adapting actively to non‑stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early‑exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS‑Net achieves state‑of‑the‑art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence‑driven early‑exit capability drastically reduces dynamic energy consumption by up to 98%. PAS‑Net establishes a robust, ultra‑low‑power neuromorphic standard for always‑on wearable sensing. The source code and pre‑trained models are publicly available at https://github.com/zhengnaichuan2022/PAS‑Net.git.

Authors:Mariano Fernández Méndez
Title: Descriptor-Injected Cross-Modal Learning: A Systematic Exploration of Audio-MIDI Alignment via Spectral and Melodic Features
Abstract:
Cross‑modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality‑specific encoders with hand‑crafted domain features, as a bridge across this gap. In a three‑phase campaign covering 13 descriptor‑mechanism combinations, 6 architectural families, and 3 training schedules, the best configuration reaches a mean S of 84.0 percent across five independent seeds, improving the descriptor‑free baseline by 8.8 percentage points. Causal ablation shows that the audio descriptor A4, based on octave‑band energy dynamics, drives the gain in the top dual models, while the MIDI descriptor D4 has only a weak inference‑time effect despite improving training dynamics. We also introduce reverse cross‑attention, where descriptor tokens query encoder features, reducing attention operations relative to the standard formulation while remaining competitive. CKA analysis shows that descriptors substantially increase audio‑MIDI transformer layer alignment, indicating representational convergence rather than simple feature concatenation. Perturbation analysis identifies high‑frequency octave bands as the dominant discriminative signal. All experiments use MAESTRO v3.0.0 with an evaluation protocol controlling for composer and piece similarity.

Authors:Mani Rash Ahmadi
Title: The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
Abstract:
We prove that in a coupled Kuramoto oscillator network at stable equilibrium, the physical phase displacement under weak output nudging is the gradient of the loss with respect to natural frequencies, with equality as the nudging strength beta tends to zero. Prior oscillator equilibrium propagation work explicitly set aside natural frequency as a learnable parameter; we show that on sparse layered architectures, frequency learning outperforms coupling‑weight learning among converged seeds (96.0% vs. 83.3% at matched parameter counts, p = 1.8e‑12). The approximately 50% convergence failure rate under random initialization is a loss‑landscape property, not a gradient error; topology‑aware spectral seeding eliminates it in all settings tested (46/100 to 100/100 seeds on the primary task; 50/50 on a second task, K‑only training, and a larger architecture).

Authors:Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
Title: The Amazing Agent Race: Strong Tool Users, Weak Navigators
Abstract:
Existing tool‑use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork‑merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi‑step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live‑API validation. Three complementary metrics (finish‑line accuracy, pit‑stop visit rate, and roadblock completion rate) separately diagnose navigation, tool‑use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool‑use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the‑amazing‑agent‑race

Authors:Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang
Title: A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction
Abstract:
Learning multi‑scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content‑Aware Pyramid Attention Network (A3‑FPN), to augment multi‑scale feature representation via the asymptotically disentangled framework and content‑aware attention modules. Specifically, A3‑FPN employs a horizontally‑spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position‑wise offsets and weights for context‑aware resampling, and learns deep context reweights to improve intra‑category similarity. In feature reassembly, it further strengthens intra‑scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019‑DET and Cityscapes demonstrate that A3‑FPN can be easily integrated into state‑of‑the‑art CNN and Transformer‑based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin‑L backbone, A3‑FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason‑ching/A3‑FPN.

Authors:Luca Jiang-Tao Yu, Chenshu Wu
Title: RF-LEGO: Modularized Signal Processing-Deep Learning Co-Design for RF Sensing via Deep Unrolling
Abstract:
Wireless sensing, traditionally relying on signal processing (SP) techniques, has recently shifted toward data‑driven deep learning (DL) to achieve performance breakthroughs. However, existing deep wireless sensing models are typically end‑to‑end and task‑specific, lacking reusability and interpretability. We propose RF‑LEGO, a modular co‑design framework that transforms interpretable SP algorithms into trainable, physics‑grounded DL modules through deep unrolling. By replacing hand‑tuned parameters with learnable ones while preserving core processing structures and mathematical operators, RF‑LEGO ensures modularity, cascadability, and structure‑aligned interpretability. Specifically, we introduce three deep‑unrolled modules for critical RF sensing tasks: frequency transform, spatial angle estimation, and signal detection. Extensive experiments using real‑world data for Wi‑Fi, millimeter‑wave, UWB, and 6G sensing demonstrate that RF‑LEGO significantly outperforms existing SP and DL baselines, both standalone and when integrated into multiple downstream tasks. RF‑LEGO pioneers a novel SP‑DL co‑design paradigm for wireless sensing via deep unrolling, shedding light on efficient and interpretable deep wireless sensing solutions. Our code is available at https://github.com/aiot‑lab/RF‑LEGO.

Authors:Rui Lin, Zhenyu Jin, Guancheng Zhou, Xuyang Ge, Wentao Shu, Jiaxing Wu, Junxuan Wang, Zhengfu He, Junping Zhang, Xipeng Qiu
Title: Tracing the Thought of a Grandmaster-level Chess-Playing Transformer
Abstract:
While modern transformer neural networks achieve grandmaster‑level performance in chess and other reasoning tasks, their internal computation process remains largely opaque. Focusing on Leela Chess Zero (LC0), we introduce a sparse decomposition framework to interpret its internal computation by decomposing its MLP and attention modules with sparse replacement layers, which capture the primary computation process of LC0. We conduct a detailed case study showing that these pathways expose rich, interpretable tactical considerations that are empirically verifiable. We further introduce three quantitative metrics and show that LC0 exhibits parallel reasoning behavior consistent with the inductive bias of its policy head architecture. To the best of our knowledge, this is the first work to decompose the internal computation of a transformer on both MLP and attention modules for interpretability. Combining sparse replacement layers and causal interventions in LC0 provides a comprehensive understanding of advanced tactical reasoning, offering critical insights into the underlying mechanisms of superhuman systems. Our code is available at https://github.com/JacklE0niden/Leela‑SAEs.

Authors:Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong
Title: Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
Abstract:
As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS‑related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome‑Attention‑Sink.

Authors:Yujie Li, Jiuniu Wang, Mugen Peng, Guangzuo Li, Wenjia Xu
Title: Graph-RHO: Critical-path-aware Heterogeneous Graph Network for Long-Horizon Flexible Job-Shop Scheduling
Abstract:
Long‑horizon Flexible Job‑Shop Scheduling~(FJSP) presents a formidable combinatorial challenge due to complex, interdependent decisions spanning extended time horizons. While learning‑based Rolling Horizon Optimization~(RHO) has emerged as a promising paradigm to accelerate solving by identifying and fixing invariant operations, its effectiveness is hindered by the structural complexity of FJSP. Existing methods often fail to capture intricate graph‑structured dependencies and ignore the asymmetric costs of prediction errors, in which misclassifying critical‑path operations is significantly more detrimental than misclassifying non‑critical ones. Furthermore, dynamic shifts in predictive confidence during the rolling process make static pruning thresholds inadequate. To address these limitations, we propose Graph‑RHO, a novel critical‑path‑aware graph‑based RHO framework. First, we introduce a topology‑aware heterogeneous graph network that encodes subproblems as operation‑machine graphs with multi‑relational edges, leveraging edge‑feature‑aware message passing to predict operation stability. Second, we incorporate a critical‑path‑aware mechanism that injects inductive biases during training to distinguish highly sensitive bottleneck operations from robust ones. Third, we devise an adaptive thresholding strategy that dynamically calibrates decision boundaries based on online uncertainty estimation to align model predictions with the solver's search space. Extensive experiments on standard benchmarks demonstrate that \mboxGraph‑RHO establishes a new state of the art in solution quality and computational efficiency. Remarkably, it exhibits exceptional zero‑shot generalization, reducing solve time by over 30% on large‑scale instances (2000 operations) while achieving superior solution quality. Our code is available \hrefhttps://github.com/IntelliSensing/Graph‑RHOhere.

Authors:Kening Wang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiale Wei, Kailun Yang, Rainer Stiefelhagen, Kunyu Peng
Title: Towards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels
Abstract:
Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label‑noise‑robust multi‑source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi‑Source Domain‑Generalized Sleep Staging (NL‑DGSS) and show that existing noisy‑label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF‑TRUST, a domain‑invariant multimodal sleep staging framework with Joint Time‑Frequency Early Learning Regularization (JTF‑ELR). By jointly exploiting temporal and spectral consistency together with confidence‑diversity regularization, FF‑TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state‑of‑the‑art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at https://github.com/KNWang970918/FF‑TRUST.git.

Authors:Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee
Title: Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
Abstract:
Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT‑v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS‑MARCO, both models show a drop of 86‑97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8‑point gap due to ConstBERT's sparse centroid coverage, and fine‑tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi‑vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi‑vector‑reproducibility.

Authors:Weijian Mai, Mu Nan, Yu Zhu, Jiahang Cao, Rui Zhang, Yuqin Dai, Chunfeng Song, Andrew F. Luo, Jiamin Wu
Title: NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
Abstract:
Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (1) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (2) Cross‑modal Flow Matching (XFM) bypasses the typical paradigm of noise‑to‑data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time‑dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding‑decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain‑computer interfaces.

Authors:Justin Li, Daniel Ding, Asmita Yuki Pritha, Aryana Hou, Xin Wang, Shu Hu
Title: Robust Fair Disease Diagnosis in CT Images
Abstract:
Automated diagnosis from chest CT has improved considerably with deep learning, but models trained on skewed datasets tend to perform unevenly across patient demographics. However, the situation is worse than simple demographic bias. In clinical data, class imbalance and group underrepresentation often coincide, creating compound failure modes that neither standard rebalancing nor fairness corrections can fix alone. We introduce a two‑level objective that targets both axes of this problem. Logit‑adjusted cross‑entropy loss operates at the sample level, shifting decision margins by class frequency with provable consistency guarantees. Conditional Value at Risk aggregation operates at the group level, directing optimization pressure toward whichever demographic group currently has the higher loss. We evaluate on the Fair Disease Diagnosis benchmark using a 3D ResNet‑18 pretrained on Kinetics‑400, classifying CT volumes into Adenocarcinoma, Squamous Cell Carcinoma, COVID‑19, and Normal groups with patient sex annotations. The training set illustrates the compound problem concretely: squamous cell carcinoma has 84 samples total, 5 of them female. The combined loss reaches a gender‑averaged macro F1 of 0.8403 with a fairness gap of 0.0239, a 13.3% improvement in score and 78% reduction in demographic disparity over the baseline. Ablations show that each component alone falls short. The code is publicly available at https://github.com/Purdue‑M2/Fair‑Disease‑Diagnosis.

Authors:Dongmin Kim, Hoshinori Kanazawa, Yasuo Kuniyoshi
Title: Active Inference with a Self-Prior in the Mirror-Mark Task
Abstract:
The mirror self‑recognition test evaluates whether a subject touches a mark on its own body that is visible only in a mirror, and is widely used as an indicator of self‑awareness. In this study, we present a computational model in which this behavior emerges spontaneously through a single mechanism, the self‑prior, without any external reward. The self‑prior, implemented with a Transformer, learns the density of familiar multisensory experiences; when a novel mark appears, the discrepancy from this learned distribution drives mark‑directed behavior through active inference. A simulated infant, relying solely on vision and proprioception without tactile input, discovered a sticker placed on its own face in the mirror and removed it in approximately 70% of cases without any explicit instruction. Expected free energy decreased significantly after sticker removal, confirming that the self‑prior operates as an internal criterion for distinguishing self from non‑self. Cross‑modal sampling further demonstrated that the self‑prior captures visual‑‑proprioceptive associations, functioning as a probabilistic body schema. These results provide a concise computational account of the key behavior observed in the mirror test and suggest that the free energy principle can serve as a unifying hypothesis for investigating the developmental origins of self‑awareness. Code is available at: https://github.com/kim135797531/self‑prior‑mirror

Authors:Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques
Title: LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
Abstract:
Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI‑driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real‑world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB‑Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real‑world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB‑Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB‑Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model‑specific accuracy differences range from ‑26% to ‑46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB‑Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.

Authors:Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi M. Kalayeh, Björn Ommer
Title: Envisioning the Future, One Step at a Time
Abstract:
Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent‑space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large‑scale exploration of future hypotheses costly and limits performance when long‑horizon, multi‑modal motion is essential. We address this by formulating the prediction of open‑set future scene dynamics as step‑wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics‑centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long‑range coherence. We further introduce OWM, a benchmark for open‑set motion prediction based on diverse in‑the‑wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real‑world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders‑of‑magnitude higher sampling speed, making open‑set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.

Authors:Kyle Whitecross, Negin Rahimi
Title: RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval
Abstract:
We propose RecaLLM, a set of reasoning language models post‑trained to make effective use of long‑context information. In‑context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open‑source LLMs, we observe that in‑context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test‑time scaling that we refer to as lost‑in‑thought: reasoning steps that improve performance also make subsequent in‑context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in‑context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible‑overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long‑context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long‑context approaches, highlighting a promising path toward improving long‑context performance without expensive long‑context training data.

Authors:Maksim Anisimov, Francesco Belardinelli, Matthew Wicker
Title: SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
Abstract:
Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety‑critical tasks. Often, deployment environments exhibit non‑stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid‑world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation‑based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.

Authors:Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang
Title: Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
Abstract:
Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera‑controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel‑aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self‑Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre‑defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera‑controlled video generation, and introduce a closed‑loop self‑consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Plücker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.

Authors:Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, Peng Huang, Shijie Liu, Zhengli Hao, Hao Li, Yitian Li, Wenqi Zhou, Zhihan Zhao, Zongqi He, Hongtao Wen, Shouwang Huang, Peng Yun, Bowen Cheng, Pok Kazaf Fu, Wai Kit Lai, Jiahao Chen, Kaiyuan Wang, Zhixuan Sun, Ziqi Li, Haochen Hu, Di Zhang, Chun Ho Yuen, Bing Wang, Zhihua Wang, Chuhang Zou, Bo Yang
Title: PhysInOne: Visual Physics Learning and Reasoning in One Suite
Abstract:
We present PhysInOne, a large‑scale synthetic dataset addressing the critical scarcity of physically‑grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground‑truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics‑aware video generation, long‑/short‑term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine‑tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics‑grounded world models in generation, simulation, and embodied AI.

Authors:Prasad Nimantha Madusanka Ukwatta Hewage, Midhun Chakkravarthy, Ruvan Kumara Abeysekara
Title: Variational Quantum Physics-Informed Neural Networks for Hydrological PDE-Constrained Learning with Inherent Uncertainty Quantification
Abstract:
We propose a Hybrid Quantum‑Classical Physics‑Informed Neural Network (HQC‑PINN) that integrates parameterized variational quantum circuits into the PINN framework for hydrological PDE‑constrained learning. Our architecture encodes multi‑source remote sensing features into quantum states via trainable angle encoding, processes them through a hardware‑efficient variational ansatz with entangling layers, and constrains the output using the Saint‑Venant shallow water equations and Manning's flow equation as differentiable physics loss terms. The inherent stochasticity of quantum measurement provides a natural mechanism for uncertainty quantification without requiring explicit Bayesian inference machinery. We further introduce a quantum transfer learning protocol that pre‑trains on multi‑hazard disaster data before fine‑tuning on flood‑specific events. Numerical simulations on multi‑modal satellite and meteorological data from the Kalu River basin, Sri Lanka, show that the HQC‑PINN achieves convergence in ~3x fewer training epochs and uses ~44% fewer trainable parameters compared to an equivalent classical PINN, while maintaining competitive classification accuracy. Theoretical analysis indicates that hydrological physics constraints narrow the effective optimization landscape, providing a natural mitigation against barren plateaus in variational quantum circuits. This work establishes the first application of quantum‑enhanced physics‑informed learning to hydrological prediction and demonstrates a viable path toward quantum advantage in environmental science.

Authors:Nazir Nayal, Christopher Wewer, Jan Eric Lenssen
Title: MixFlow: Mixed Source Distributions Improve Rectified Flows
Abstract:
Diffusion models and their variations, such as rectified flows, generate diverse and high‑quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing κ\texttt‑FC, a general formulation that conditions the source distribution on an arbitrary signal κ that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a κ\texttt‑FC‑based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12% in FID compared to standard rectified flow and 7% compared to previous baselines under a fixed sampling budget. Code available at: \hrefhttps://github.com/NazirNayal8/MixFlowhttps://github.com/NazirNayal8/MixFlow

Authors:Mengxin Fu, Yuezun Li
Title: Detecting Diffusion-generated Images via Dynamic Assembly Forests
Abstract:
Diffusion models are known for generating high‑quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion‑generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion‑generated image detector. Compared to existing DNN‑based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource‑constrained scenarios. Our code and models are available at https://github.com/OUC‑VAS/DAF.

Authors:Jiabao Brad Wang, Xiang Shi, Yiliang Yuan, Mustafa Misir
Title: GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimisation
Abstract:
Automated algorithm selection in continuous black‑box optimisation typically relies on fixed landscape descriptors computed under a limited probing budget, yet such descriptors can degrade under problem‑split or cross‑benchmark evaluation. We propose GeoPAS, a geometric probing approach that represents a problem instance by multiple coarse two‑dimensional slices sampled across locations, orientations, and logarithmic scales. A shared validity‑aware convolutional encoder maps each slice to an embedding, conditions it on slice‑scale and amplitude statistics, and aggregates the resulting features permutation‑invariantly for risk‑aware solver selection via log‑scale performance prediction with an explicit penalty on tail failures. On COCO/BBOB with a 12‑solver portfolio in dimensions 2‑‑10, GeoPAS improves over the single best solver under leave‑instance‑out, grouped random, and leave‑problem‑out evaluation. These results suggest that multi‑scale geometric slices provide a useful transferable static signal for algorithm selection, although a small number of heavy‑tail regimes remain and continue to dominate the mean. Our code is available at https://github.com/BradWangW/GeoPAS.

Authors:Salva Rühling Cachay, Duncan Watson-Parris, Rose Yu
Title: U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
Abstract:
AI‑based weather forecasting now rivals traditional physics‑based ensembles, but state‑of‑the‑art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce U‑Cast, a probabilistic forecaster built on a standard U‑Net backbone trained with a simple recipe: deterministic pre‑training on Mean Absolute Error followed by short probabilistic fine‑tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5^\circ\ resolution while reducing training compute by over 10× compared to leading CRPS‑based models and inference latency by over 10× compared to diffusion‑based models. U‑Cast trains in under 12 H200 GPU‑days and generates a 60‑step ensemble forecast in 11 seconds. These results suggest that scalable, general‑purpose architectures paired with efficient training curricula can match complex domain‑specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community. Our code is available at: https://github.com/Rose‑STL‑Lab/u‑cast.

Authors:Yi Luo, Xu Sun, Guangchun Luo, Aiguo Chen
Title: Neighbourhood Transformer: Switchable Attention for Monophily-Aware Graph Learning
Abstract:
Graph neural networks (GNNs) have been widely adopted in engineering applications such as social network analysis, chemical research and computer vision. However, their efficacy is severely compromised by the inherent homophily assumption, which fails to hold for heterophilic graphs where dissimilar nodes are frequently connected. To address this fundamental limitation in graph learning, we first draw inspiration from the recently discovered monophily property of real‑world graphs, and propose Neighbourhood Transformers (NT), a novel paradigm that applies self‑attention within every local neighbourhood instead of aggregating messages to the central node as in conventional message‑passing GNNs. This design makes NT inherently monophily‑aware and theoretically guarantees its expressiveness is no weaker than traditional message‑passing frameworks. For practical engineering deployment, we further develop a neighbourhood partitioning strategy equipped with switchable attentions, which reduces the space consumption of NT by over 95% and time consumption by up to 92.67%, significantly expanding its applicability to larger graphs. Extensive experiments on 10 real‑world datasets (5 heterophilic and 5 homophilic graphs) show that NT outperforms all current state‑of‑the‑art methods on node classification tasks, demonstrating its superior performance and cross‑domain adaptability. The full implementation code of this work is publicly available at https://github.com/cf020031308/MoNT to facilitate reproducibility and industrial adoption.

Authors:Benjamin Amoh, Geoffrey Parker, Wesley Marrero
Title: Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication
Abstract:
Multi‑agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce SeqComm‑DFL, unifying the sequential communication with decision‑focused learning for task performance. Our approach features \emphvalue‑aware message generation with sequential Stackelberg conditioning: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emphguidance potential determined by their prosocial ordering. We extend Optimal Model Design to communication‑augmented world models with QMIX factorization, enabling efficient end‑to‑end training via implicit differentiation. We prove information‑theoretic bounds showing that communication value scales with coordination gaps and establish \mathcalO(1/\sqrtT) convergence for the bilevel optimization, where T denotes the number of training iterations. On collaborative healthcare and StarCraft Multi‑Agent Challenge (SMAC) benchmarks, SeqComm‑DFL achieves four to six times higher cumulative rewards and over 13% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

Authors:Taojie Zhu, Dongyang Xu, Ding Zou, Sen Zhao, Qiaobo Hao, Zhiguo Yang, Yonghong He
Title: Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
Abstract:
Post‑training paradigms for Large Language Models (LLMs), primarily Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suffers from high fitting bias, while RL enables exploration (low bias) but grapples with high gradient variance. Existing unified optimization strategies often employ naive loss weighting, overlooking the statistical conflict between these distinct gradient signals. In this paper, we provide a rigorous theoretical analysis of this bias‑variance trade‑off and propose DYPO (Dynamic Policy Optimization), a unified framework designed to structurally mitigate this conflict. DYPO integrates three core components: (1) a Group Alignment Loss (GAL) that leverages intrinsic group dynamics to significantly reduce RL gradient variance; (2) a Multi‑Teacher Distillation mechanism that corrects SFT fitting bias via diverse reasoning paths; and (3) a Dynamic Exploitation‑Exploration Gating mechanism that adaptively arbitrates between stable SFT and exploratory RL based on reward feedback. Theoretical analysis confirms that DYPO linearly reduces fitting bias and minimizes overall variance. Extensive experiments demonstrate that DYPO significantly outperforms traditional sequential pipelines, achieving an average improvement of 4.8% on complex reasoning benchmarks and 13.3% on out‑of‑distribution tasks. Our code is publicly available at https://github.com/Tocci‑Zhu/DYPO.

Authors:Rafael da Silva, Jeff Eicher, Gregory Longo
Title: A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout
Abstract:
This study proposes a temporal modeling framework with a counterfactual policy‑simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time‑to‑event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class‑balanced logistic regression over person‑‑period rows. Under a late‑event temporal holdout, the model attains row‑level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest‑risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario‑indexed policy layer produces survival contrasts ΔS(T) under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch (T_\rm policy=18: 0.0102, 0.0260, 0.0819), while the mechanism‑aware branch is negative (ΔS_\rm mech(18)=‑0.0078, ΔS_\rm mech(38)=‑0.0134). A subgroup analysis by gender quantifies scenario‑induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework's capacity for internal structural scenario comparison under observational data constraints.

Authors:Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, René Vidal
Title: Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety‑related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary‑Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption‑image stimuli and summarizing their activations into concept directions. We name the dataset DACO‑400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM‑SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general‑purpose capabilities.

Authors:Roi Paul
Title: Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
Abstract:
We study whether low‑rank spectral summaries of LoRA weight deltas can identify which fine‑tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre‑registered experiment on \textttLlama‑3.2‑3B‑Instruct, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation‑steering‑derived adapters, and extract per‑layer spectral features (norms, stable rank, singular‑value entropy, effective rank, and singular‑vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near‑perfect ordinal severity ranking (ρ\geq 0.956). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query‑projection weights detect that drift occurred; value‑projection weights identify which objective. Cross‑method generalization fails completely: a DPO‑trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO‑inverted‑harmlessness adapters show elevated harmful compliance on HEx‑PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, Δ= +0.154), with near‑perfect dose‑‑response (ρ= 0.986). The geometry‑to‑behavior rank correlation is ρ= 0.72 across 24 non‑steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight‑space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross‑method monitoring requires per‑method calibration.

Authors:Mehmet Kerem Turkcan
Title: Loom: A Scalable Analytical Neural Computer Architecture
Abstract:
We present Loom, a computer architecture that executes programs compiled from C inside a looped transformer whose weights are derived analytically. The architecture implements a 22‑opcode instruction set in 8 transformer layers. Each forward pass executes one instruction; the model is applied iteratively until the program counter reaches zero. The full machine state resides in a single tensor X \in \mathbbR^d × n of fixed size, and every step has fixed cost for fixed d and n, independent of program length or execution history. The default configuration uses d = 155 and n = 1024, yielding 4.7 million parameters and 928 instruction slots. A compact configuration at d = 146 and n = 512 suffices for a 9×9 Sudoku solver (284 instructions). The weights are program‑independent: programs live in the state tensor, and the same fixed‑weight model executes any compiled program. We make Loom source code publicly available at https://github.com/mkturkcan/Loom.

Authors:Zewei Zhou, Jiajun Zou, Jiajia Zhang, Ao Yang, Ruichao He, Haozheng Zhou, Ao Liu, Jiawei Liu, Leilei Jin, Shan Shen, Daying Sun
Title: R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
Abstract:
Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL‑to‑GDSII), a multi‑view circuit‑graph benchmark suite that standardizes five stage‑aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open‑source IP cores (up to 10^6 nodes/edges). R2G provides an end‑to‑end DEF‑to‑graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph‑ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R^2 varying by more than 0.3 across representations for a fixed GNN; (ii) node‑centric views generalize best across both placement and routing; and (iii) decoder‑head depth (3‑‑4 layers) is the primary accuracy driver, turning divergent training into near‑perfect predictions (R^2>0.99). Code and datasets are available at https://github.com/ShenShan123/R2G.

Authors:Yingjie Yu, Mingyuan Wu, Ahmadreza Eslaminia, Lingzhi Zhao, Kaizhuo Yan, Klara Nahrstedt
Title: QoS-QoE Translation with Large Language Model
Abstract:
QoS‑QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user‑perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific setups and remain scattered across papers, experimental settings, and reporting formats, limiting systematic reuse, cross‑scenario generalization, and large‑scale analysis. To address this gap, we first introduce QoS‑QoE Translation dataset, a source‑grounded dataset of structured QoS‑QoE relationships from the multimedia literature, with a focus on video streaming related tasks. We construct the dataset through an automated pipeline that combines paper curation, QoS‑QoE relationship extraction, and iterative data evaluation. Each record preserves the extracted relationship together with parameter definitions, supporting evidence, and contextual metadata. We further evaluate the capability of large language models (LLMs) on QoS‑QoE translation, both before and after supervised fine‑tuning on our dataset, and show strong performance on both continuous‑value and discrete‑label prediction in bidirectional translation, from QoS‑QoE and QoE‑QoS. Our dataset provides a foundation for benchmarking LLMs in QoS‑QoE translation and for supporting future LLM‑based reasoning for multimedia quality prediction and optimization. The complete dataset and code are publicly available at https://yyu6969.github.io/qos‑qoe‑translation‑page/, for full reproducibility and open access.

Authors:Siddharth Mishra-Sharma, Tracy R. Slatyer, Yitian Sun, Yuqing Wu
Title: High-dimensional inference for the $γ$-ray sky with differentiable programming
Abstract:
We motivate the use of differentiable probabilistic programming techniques in order to account for the large model‑space inherent to astrophysical γ‑ray analyses. Targeting the longstanding Galactic Center γ‑ray Excess (GCE) puzzle, we construct differentiable forward model and likelihood that make liberal use of GPU acceleration and vectorization in order to simultaneously account for a continuum of possible spatial morphologies consistent with the GCE emission in a fully probabilistic manner. Our setup allows for efficient inference over the large model space using variational methods. Beyond application to γ‑ray data, a goal of this work is to showcase how differentiable probabilistic programming can be used as a tool to enable flexible analyses of astrophysical datasets.

Authors:Yuki Kataoka, Masahiro Banno, Michihito Kyo, Shuri Nakao, Tomoo Sato, Shunsuke Taito, Tomohiro Takayama, Takahiro Tsuge, Yasushi Tsujimoto, Ryuhei So, Toshi A. Furukawa
Title: TiAb Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening
Abstract:
Background: Server‑based screening tools impose subscription costs, while open‑source alternatives require coding skills. Objectives: We developed a browser extension that provides no‑code, serverless artificial intelligence (AI)‑assisted title and abstract screening and examined its functionality. Methods: TiAb Review Plugin is an open‑source Chrome browser extension (available at https://chromewebstore.google.com/detail/tiab‑review‑plugin/alejlnlfflogpnabpbplmnojgoeeabij). It uses Google Sheets as a shared database, requiring no dedicated server and enabling multi‑reviewer collaboration. Users supply their own Gemini API key, stored locally and encrypted. The tool offers three screening modes: manual review, large language model (LLM) batch screening, and machine learning (ML) active learning. For ML evaluation, we re‑implemented the default ASReview active learning algorithm (TF‑IDF with Naive Bayes) in TypeScript to enable in‑browser execution, and verified equivalence against the original Python implementation using 10‑fold cross‑validation on six datasets. For LLM evaluation, we compared 16 parameter configurations across two model families on a benchmark dataset, then validated the optimal configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95) with a sensitivity‑oriented prompt on five public datasets (1,038 to 5,628 records, 0.5 to 2.0 percent prevalence). Results: The TypeScript classifier produced top‑100 rankings 100 percent identical to the original ASReview across all six datasets. For LLM screening, recall was 94 to 100 percent with precision of 2 to 15 percent, and Work Saved over Sampling at 95 percent recall (WSS@95) ranged from 48.7 to 87.3 percent. Conclusions: We developed a functional browser extension that integrates LLM screening and ML active learning into a no‑code, serverless environment, ready for practical use in systematic review screening.

Authors:Brendan R. Hogan, Xiwen Chen, James T. Wilson, Kashif Rasul, Adel Boyarsky, Thomas Kamei, Anderson Schneider, Yuriy Nevmyvaka
Title: AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs
Abstract:
We present AlphaLab, an autonomous research harness that leverages frontier LLM agentic capabilities to automate the full experimental cycle in quantitative, computation‑intensive domains. Given only a dataset and a natural‑language objective, AlphaLab proceeds through three phases without human intervention: (1) it adapts to the domain and explores the data, writing analysis code and producing a research report; (2) it constructs and adversarially validates its own evaluation framework; and (3) it runs large‑scale GPU experiments via a Strategist/Worker loop, accumulating domain knowledge in a persistent playbook that functions as a form of online prompt optimization. All domain‑specific behavior is factored into adapters generated by the model itself, so the same pipeline handles qualitatively different tasks without modification. We evaluate AlphaLab with two frontier LLMs (GPT‑5.2 and Claude Opus 4.6) on three domains: CUDA kernel optimization, where it writes GPU kernels that run 4.4x faster than torch.compile on average (up to 91x); LLM pretraining, where the full system achieves 22% lower validation loss than a single‑shot baseline using the same model; and traffic forecasting, where it beats standard baselines by 23‑25% after researching and implementing published model families from the literature. The two models discover qualitatively different solutions in every domain (neither dominates uniformly), suggesting that multi‑model campaigns provide complementary search coverage. We additionally report results on financial time series forecasting in the appendix, and release all code at https://brendanhogan.github.io/alphalab‑paper/.

Authors:Gianluca Guglielmo, Marc Masana
Title: Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
Abstract:
State‑of‑the‑art post‑hoc out‑of‑distribution detection methods rely on intermediate layer activation editing. However, they exhibit inconsistent performance across datasets and models. We show that this instability is driven by differences in the activation distributions, and identify a failure mode of scaling‑based methods that arises when penultimate layer activations are not rectified. Motivated by this analysis, we propose \ours, a hyperparameter‑free post‑hoc method that replaces sorted activation magnitudes with a fixed in‑distribution reference profile. Our simple plug‑and‑play method shows strong and consistent performance across datasets and architectures without assumptions on the penultimate layer activation function, and without requiring any hyperparameter tuning, while preserving in‑distribution classification accuracy by construction. We further analyze what drives the improvement, showing that both inhibiting and exciting activation shifts independently contribute to better out‑of‑distribution discrimination.

Authors:Svetoslav Nizhnichenkov, Rahul Nair, Elizabeth Daly, Brian Mac Namee
Title: A Representation-Level Assessment of Bias Mitigation in Foundation Models
Abstract:
We investigate how successful bias mitigation reshapes the embedding space of encoder‑only and decoder‑only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias‑mitigated variants of the models. Our findings show that bias mitigation reduces gender‑occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder‑only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (https://github.com/winodec/wino‑dec)

Authors:Mu Nan, Muquan Yu, Weijian Mai, Jacob S. Prince, Hossein Adeli, Rui Zhang, Jiahang Cao, Benjamin Becker, John A. Pyles, Margaret M. Henderson, Chunfeng Song, Nikolaus Kriegeskorte, Michael J. Tarr, Xiaoqing Hu, Andrew F. Luo
Title: Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
Abstract:
Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field‑wide goal is to achieve generalizable, cross‑subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine‑tuning separately for each subject. To address this challenge, we introduce a meta‑optimized approach for semantic visual decoding from fMRI that generalizes to novel subjects without any fine‑tuning. By simply conditioning on a small set of image‑brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in‑context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per‑voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross‑subject and cross‑scanner generalization across diverse visual backbones without retraining or fine‑tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non‑invasive brain decoding.

Authors:Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, Jinyuan Jia
Title: PIArena: A Platform for Prompt Injection Evaluation
Abstract:
Prompt injection attacks pose serious security risks across a wide range of real‑world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably compare defenses, understand their true robustness under diverse attacks, or assess how well they generalize across tasks and benchmarks. For instance, many defenses initially reported as effective were later found to exhibit limited robustness on diverse datasets and attacks. To bridge this gap, we introduce PIArena, a unified and extensible platform for prompt injection evaluation that enables users to easily integrate state‑of‑the‑art attacks and defenses and evaluate them across a variety of existing and new benchmarks. We also design a dynamic strategy‑based attack that adaptively optimizes injected prompts based on defense feedback. Through comprehensive evaluation using PIArena, we uncover critical limitations of state‑of‑the‑art defenses: limited generalizability across tasks, vulnerability to adaptive attacks, and fundamental challenges when an injected task aligns with the target task. The code and datasets are available at https://github.com/sleeepeer/PIArena.

Authors:Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel
Title: SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high‑quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction‑tuning datasets containing expert‑annotated ground‑truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non‑trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU‑Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human‑annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

Authors:Abdelkarim Loukili
Title: Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance
Abstract:
Federated learning (FL) enables privacy‑preserving predictive maintenance across distributed aerospace fleets, but gradient communication overhead constrains deployment on bandwidth‑limited IoT nodes. This paper investigates the impact of symmetric uniform quantization (b \in \32,8,4,2\ bits) on the accuracy‑‑efficiency trade‑off of a custom‑designed lightweight 1‑D convolutional model (AeroConv1D, 9\,697 parameters) trained via FL on the NASA C‑MAPSS benchmark under a realistic Non‑IID client partition. Using a rigorous multi‑seed evaluation (N=10 seeds), we show that INT4 achieves accuracy \emphstatistically indistinguishable from FP32 on both FD001 (p=0.341) and FD002 (p=0.264 MAE, p=0.534 NASA score) while delivering an 8× reduction in gradient communication cost (37.88~KiB \to 4.73~KiB per round). A key methodological finding is that naïve IID client partitioning artificially suppresses variance; correct Non‑IID evaluation reveals the true operational instability of extreme quantization, demonstrated via a direct empirical IID vs.\ Non‑IID comparison. INT2 is empirically characterized as unsuitable: while it achieves lower MAE on FD002 through extreme quantization‑induced over‑regularization, this apparent gain is accompanied by catastrophic NASA score instability (CV\,=\,45.8% vs.\ 22.3% for FP32), confirming non‑reproducibility under heterogeneous operating conditions. Analytical FPGA resource projections on the Xilinx ZCU102 confirm that INT4 fits within hardware constraints (85.5% DSP utilization), potentially enabling a complete FL pipeline on a single SoC. The full simulation codebase and FPGA estimation scripts are publicly available at https://github.com/therealdeadbeef/aerospace‑fl‑quantization.

Authors:Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
Title: DMax: Aggressive Parallel Decoding for dLLMs
Abstract:
We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask‑to‑token transition, DMax reformulates decoding as a progressive self‑refinement from mask embeddings to token embeddings. At the core of our approach is On‑Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self‑revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA‑2.0‑mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

Authors:Yunxiang Peng, Mengmeng Ma, Ziyu Yao, Xi Peng
Title: Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Abstract:
Reliable generalization metrics are fundamental to the evaluation of machine learning models. Especially in high‑stakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable and label‑free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy‑on‑the‑line, are often unreliable as they only assess model output while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using the inner workings of a model, i.e., circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across various tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 13.4% and 34.1%, respectively. Our code is available at https://github.com/deep‑real/GenCircuit.

Authors:Anders S. Olsen, Miriam L. Navarro, Claus Svarer, Jesper L. Hinrich, Morten Mørup, Gitte M. Knudsen
Title: Shift- and stretch-invariant non-negative matrix factorization with an application to brain tissue delineation in emission tomography data
Abstract:
Dynamic neuroimaging data, such as emission tomography measurements of radiotracer transport in blood or cerebrospinal fluid, often exhibit diffusion‑like properties. These introduce distance‑dependent temporal delays, scale‑differences, and stretching effects that limit the effectiveness of conventional linear modeling and decomposition methods. To address this, we present the shift‑ and stretch‑invariant non‑negative matrix factorization framework. Our approach estimates both integer and non‑integer temporal shifts as well as temporal stretching, all implemented in the frequency domain, where shifts correspond to phase modifications, and where stretching is handled via zero‑padding or truncation. The model is implemented in PyTorch (https://github.com/anders‑s‑olsen/shiftstretchNMF). We demonstrate on synthetic data and brain emission tomography data that the model is able to account for stretching to provide more detailed characterization of brain tissue structure.

Authors:Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
Title: Small Vision-Language Models are Smart Compressors for Long Video Understanding
Abstract:
Adapting Multimodal Large Language Models (MLLMs) for hour‑long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost‑in‑the‑middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query‑aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision‑Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross‑modal distillation process to generate compact, intent‑aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero‑shot relevance prior and semantic front‑loading, ATA acts as a training‑free O(1) dynamic router. It allocates dense bandwidth to query‑critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state‑of‑the‑art performance with aggressive dynamic compression (0.5‑16 tokens/frame). On the extreme‑long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT‑4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour‑long videos substantially below theoretical limits, proving true long‑form video understanding relies on intent‑driven efficiency rather than greedily padded context windows.

Authors:Soumya Mazumdar, Vineet Kumar Rakesh, Tapas Samanta
Title: PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation
Abstract:
Talking‑head generation has advanced rapidly with diffusion‑based generative models, but training usually depends on centralized face‑video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking‑head generation, where identity‑specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy‑aware federated framework for personalized talking‑head generation that combines conditional latent diffusion with parameter‑efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio‑visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity‑Stable Federated Aggregation (ISFA) weights client updates using privacy‑safe scalar reliability signals computed from on‑device identity consistency and temporal stability estimates. Temporal‑Denoising Consistency (TDC) regularization is introduced to reduce inter‑frame drift, flicker, and identity drift during federated denoising. To limit update‑side privacy risk, secure aggregation and client‑level differential privacy are applied to adapter updates. The implementation supports both low‑memory GPU execution and multi‑GPU client‑parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end‑to‑end training and evaluation under constrained resources. The results support the feasibility of privacy‑aware personalized talking‑head training in federated environments, while suggesting that stronger component‑wise, privacy‑utility, and qualitative claims need further standardized evaluation.

Authors:Minh Sao Khue Luu, Evgeniy N. Pavlovskiy, Bair N. Tuchinov
Title: Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI
Abstract:
We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component‑Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion‑level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU‑Net loss to jointly optimize voxel‑level segmentation accuracy and lesion‑level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU‑Net framework and 5‑fold cross‑validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component‑level and lesion‑level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \hrefhttps://github.com/luumsk/SmallLesionMRIthis url.

Authors:Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler
Title: SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
Abstract:
Retrieving rare and safety‑critical driving scenarios from large‑scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large‑scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high‑quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle‑in‑a‑haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance‑level retrieval, SearchAD emphasizes semantic image retrieval with a well‑defined data split, enabling text‑to‑image and image‑to‑image retrieval, few‑shot learning, and fine‑tuning of multi‑modal retrieval models. Comprehensive evaluations show that text‑based methods outperform image‑based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero‑shot results, and our fine‑tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held‑out test set on a public benchmark server, SearchAD establishes the first large‑scale dataset for retrieval‑driven data curation and long‑tail perception research in AD: https://iis‑esslingen.github.io/searchad/

Authors:Shuaiting Li, Juncan Deng, Kedong Xu, Rongtao Deng, Hong Gu, Minghan Jiang, Haibin Shen, Kejie Huang
Title: Rethinking Residual Errors in Compensation-based LLM Quantization
Abstract:
Methods based on weight compensation, which iteratively apply quantization and weight compensation to minimize the output error, have recently demonstrated remarkable success in quantizing Large Language Models (LLMs). The representative work, GPTQ, introduces several key techniques that make such iterative methods practical for LLMs with billions of parameters. GPTAQ extends this approach by introducing an asymmetric calibration process that aligns the output of each quantized layer with its full‑precision counterpart, incorporating a residual error into the weight compensation framework. In this work, we revisit the formulation of the residual error. We identify a sub‑optimal calibration objective in existing methods: during the intra‑layer calibration process, they align the quantized output with the output from compensated weights, rather than the true output from the original full‑precision model. Therefore, we redefine the objective to precisely align the quantized model's output with the original output of the full‑precision model at each step. We then reveal that the residual error originates not only from the output difference of the preceding layer but also from the discrepancy between the compensated and original weights within each layer, which we name the 'compensation‑aware error'. By inheriting the neuron decomposition technique from GPTAQ, we can efficiently incorporate this compensation‑aware error into the weight update process. Extensive experiments on various LLMs and quantization settings demonstrate that our proposed enhancements integrate seamlessly with both GPTQ and GPTAQ, significantly improving their quantization performance. Our code is publicly available at https://github.com/list0830/ResComp.

Authors:Tao Han, Zhibin Wen, Zhenghao Chen, Fenghua Lin, Junyu Gao, Song Guo, Lei Bai
Title: Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting
Abstract:
While AI‑based numerical weather prediction (NWP) enables rapid forecasting, generating high‑resolution outputs remains computationally demanding due to limited multi‑scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting‑based scale‑aware vision transformer (GSSA‑ViT), a novel framework for arbitrary‑resolution forecasting and flexible downscaling of high‑dimensional atmospheric fields. Specifically, latitude‑longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale‑aware attention module is designed to capture cross‑scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale‑aware attention for unified multi‑scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high‑resolution, multi‑scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather‑GS.

Authors:Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
Title: Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Abstract:
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label‑based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio‑based methods can leverage emotionally rich speech signals ‑ and even benefit from expressive text‑to‑speech (TTS) synthesis ‑ but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images‑based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high‑quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross‑Modal Emotion Transfer (C‑MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C‑MET leverages a large‑scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA‑D datasets demonstrate that our method improves emotion accuracy by 14% over state‑of‑the‑art methods, while generating expressive talking face videos ‑ even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok‑choi.github.io/C‑MET/

Authors:Yasong Fan
Title: Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall
Abstract:
We present FDM (Fan Duality Model), a linear sequence architecture that resolves the fundamental tension between memory efficiency and associative recall in sequence modeling. FDM separates sequence processing into two components: a wave component (recurrent scan via phase‑preserving Givens rotations) that compresses long‑range patterns into a fixed‑size complex hidden state, and a particle component (local‑global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128‑8,192 tokens, versus Transformer's 853‑4,247 MB (4.9x reduction at N=8,192). Beyond the architecture, we discover that jointly training the wave and particle components leads to suboptimal convergence. We propose Freeze‑Scan, a two‑phase training strategy that freezes the recurrent scan and optimizes the cache jointly with embeddings, achieving PPL=64.9 on WikiText‑103 in 44K steps ‑‑ a 7.5x improvement over full fine‑tuning (PPL=487). On Multi‑Query Associative Recall (MQAR), FDM achieves 0.966 accuracy, surpassing Transformer (0.606) by 59.5%, while pure scan without cache scores only 0.011, confirming the necessity of the particle component. Finally, we introduce Holographic Reference Beam Decoding, interpreting the complex hidden state h_t as a holographic plate encoding the entire temporal history. Using the current input x_t as a reference beam to modulate h_t reduces PPL by up to 2.13 points (PPL=62.79) with a 4‑head orthogonal reference beam using only 1.3M additional parameters, providing empirical support for the holographic interpretation. Code and pretrained weights: https://github.com/YasongFan/FDM

Authors:David Gringras
Title: IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
Abstract:
Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre‑registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0‑3; omission harm, OH 0‑4) through a structured‑evaluation pipeline validated against physician scoring (kappa_w = 0.571, within‑1 agreement 96%). The central finding is identity‑contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety‑colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non‑colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT‑5.2, whose post‑generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.

Authors:Yang Cao
Title: Optimal Decay Spectra for Linear Recurrences
Abstract:
Linear recurrent models offer linear‑time sequence processing but often suffer from suboptimal long‑range memory. We trace this to the decay spectrum: for N channels, random initialization collapses the minimum spectral gap to O(N^‑2), yielding sub‑exponential error \exp(‑Ω(N/\log N)); linear spacing avoids collapse but degrades to \exp(‑O(N/\sqrtT)), practically algebraic over long contexts. We introduce Position‑Adaptive Spectral Tapering (PoST), an architecture‑agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log‑decay rates, proven minimax optimal at rate O(\exp(‑cN/\log T)); and (2) Position‑Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only N\log t/\log T of N channels are effective at position t) by stretching the spectrum to the actual dependency range, sharpening the rate to O(\exp(‑cN/\log t)). This scaling natively induces fractional invariance: the impulse response becomes scale‑free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba‑2, RWKV‑7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre‑training at 180M‑440M scales shows consistent zero‑shot language modeling improvements, significant long‑context retrieval gains for Mamba‑2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.

Authors:Haimeng Zhao, Alexander Zlokapa, Hartmut Neven, Ryan Babbush, John Preskill, Jarrod R. McClean, Hsin-Yuan Huang
Title: Exponential quantum advantage in processing massive classical data
Abstract:
Broadly applicable quantum advantage, particularly in classical data processing and machine learning, has been a fundamental open problem. In this work, we prove that a small quantum computer of polylogarithmic size can perform large‑scale classification and dimension reduction on massive classical data by processing samples on the fly, whereas any classical machine achieving the same prediction performance requires exponentially larger size. Furthermore, classical machines that are exponentially larger yet below the required size need superpolynomially more samples and time. We validate these quantum advantages in real‑world applications, including single‑cell RNA sequencing and movie review sentiment analysis, demonstrating four to six orders of magnitude reduction in size with fewer than 60 logical qubits. These quantum advantages are enabled by quantum oracle sketching, an algorithm for accessing the classical world in quantum superposition using only random classical data samples. Combined with classical shadows, our algorithm circumvents the data loading and readout bottleneck to construct succinct classical models from massive classical data, a task provably impossible for any classical machine that is not exponentially larger than the quantum machine. These quantum advantages persist even when classical machines are granted unlimited time or if BPP=BQP, and rely only on the correctness of quantum mechanics. Together, our results establish machine learning on classical data as a broad and natural domain of quantum advantage and a fundamental test of quantum mechanics at the complexity frontier.

Authors:Ziyi Wang, Siva Rajesh Kasa, Ankith M S, Santhosh Kumar Kasa, Jiaru Zou, Sumit Negi, Ruqi Zhang, Nan Jiang, Qifan Song
Title: DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification
Abstract:
Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble‑based verifier that blends the draft and target model distributions with a task‑dependent and context‑dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.

Authors:Michael Cuccarese
Title: Predicting Activity Cliffs for Autonomous Medicinal Chemistry
Abstract:
Activity cliff prediction ‑ identifying positions where small structural changes cause large potency shifts ‑ has been a persistent challenge in computational medicinal chemistry. This work focuses on a parsimonious definition: which small modifications, at which positions, confer the highest probability of an outcome change. Position‑level sensitivity is calculated using 25 million matched molecular pairs from 50 ChEMBL targets across six protein families, revealing that two questions have fundamentally different answers. "Which positions vary most?" is answered by scaffold size alone (NDCG@3 = 0.966), requiring no machine learning. "Which are true activity cliffs?" ‑ where small modifications cause disproportionately large effects, as captured by SALI normalization ‑ requires an 11‑feature model with 3D pharmacophore context (NDCG@3 = 0.910 vs. 0.839 random), generalizing across all six protein families, novel scaffolds (0.913), and temporal splits (0.878). The model identifies the cliff‑prone position first 53% of the time (vs. 27% random ‑ 2x lift), reducing positions a chemist must explore from 3.1 to 2.1 ‑ a 31% reduction in first‑round experiments. Predicting which modification to make is not tractable from structure alone (Spearman 0.268, collapsing to ‑0.31 on novel scaffolds). The system is released as open‑source code and an interactive webapp.

Authors:Ziyang Cheng, Haoyu Wei, Hang Yin, Xiuwei Xu, Bingyao Yu, Jie Zhou, Jiwen Lu
Title: CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
Abstract:
While decoupled control schemes for legged mobile manipulators have shown robustness, learning holistic whole‑body control policies for tracking global end‑effector poses remains fragile against Out‑of‑Distribution (OOD) inputs induced by sensor noise or infeasible user commands. To improve robustness against these perturbations without sacrificing task performance and continuity, we propose Competence Manifold Projection (CMP). Specifically, we utilize a Frame‑Wise Safety Scheme that transforms the infinite‑horizon safety constraint into a computationally efficient single‑step manifold inclusion. To instantiate this competence manifold, we employ a Lower‑Bounded Safety Estimator that distinguishes unmastered intentions from the training distribution. We then introduce an Isomorphic Latent Space (ILS) that aligns manifold geometry with safety probability, enabling efficient O(1) seamless defense against arbitrary OOD intents. Experiments demonstrate that CMP achieves up to a 10‑fold survival rate improvement in typical OOD scenarios where baselines suffer catastrophic failure, incurring under 10% tracking degradation. Notably, the system exhibits emergent ``best‑effort'' generalization behaviors to progressively accomplish OOD goals by adhering to the competence boundaries. Result videos are available at: https://shepherd1226.github.io/CMP.

Authors:Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao
Title: FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Abstract:
The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real‑world manufacturing environments. Progress is hindered by data scarcity and a lack of fine‑grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high‑quality multimodal dataset that combines real‑world 2D images and 3D point clouds, annotated with fine‑grained domain semantics (e.g., exact model numbers). We then evaluate 18 state‑of‑the‑art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain‑specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine‑tuning of a compact 3B‑parameter model on our data yields up to 90.8% relative improvement in accuracy on held‑out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain‑adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge‑web.

Authors:Daniel Nobrega Medeiros
Title: Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
Abstract:
Why does gradient descent reliably find good solutions in non‑convex neural network optimization, despite the landscape being NP‑hard in the worst case? We show that gradient flow on L‑layer ReLU networks without bias preserves L‑1 conservation laws C_l = ||W_l+1||_F^2 ‑ ||W_l||_F^2, confining trajectories to lower‑dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1‑1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta^2 S(eta), where the gradient imbalance sum S(eta) admits a closed‑form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 lambda_x,k^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross‑entropy loss, softmax probability concentration drives exponential Hessian spectral compression with timescale tau = Theta(1/eta) independent of training set size, explaining why cross‑entropy self‑regularizes the drift exponent near alpha=1.0. We identify two dynamical regimes separated by a width‑dependent transition: a perturbative sub‑Edge‑of‑Stability regime where the spectral formula applies, and a non‑perturbative regime with extensive mode coupling. All predictions are validated across 23 experiments.

Authors:Wonseon Lim, Jaesung Lee, Dae-Won Kim
Title: Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Abstract:
Continual learning (CL) on edge devices requires not only high accuracy but also training‑time efficiency to support on‑device adaptation under strict memory and computational constraints. While prompt‑based continual learning (PCL) is parameter‑efficient and achieves competitive accuracy, prior work has focused mainly on accuracy or inference‑time performance, often overlooking the memory and computational costs of on‑device training. In this paper, we propose CPS‑Prompt, a critical patch‑aware sparse prompting framework that explicitly targets training‑time memory usage and computational cost by integrating critical patch sampling (CPS) for task‑aware token reduction and decoupled prompt and classifier training (DPCT) to reduce backpropagation overhead. Experiments on three public benchmarks and real edge hardware show that CPS‑Prompt improves peak memory, training time, and energy efficiency by about 1.6x over the balanced CODA‑Prompt baseline, while maintaining accuracy within 2% of the state‑of‑the‑art C‑Prompt on average and remaining competitive with CODA‑Prompt in accuracy. The code is available at https://github.com/laymond1/cps‑prompt.

Authors:David Golchinfar, Daryoush Vaziri, Alexander Marquardt
Title: Playing DOOM with 1.3M Parameters: Specialized Small Models vs Large Language Models for Real-Time Game Control
Abstract:
We present SauerkrautLM‑Doom‑MultiVec, a 1.3 million parameter model that plays the classic first‑person shooter DOOM in real time, outperforming large language models up to 92,000x its size, including Nemotron‑120B, Qwen3.5‑27B, and GPT‑4o‑mini. Our model combines a ModernBERT encoder with hash embeddings, depth‑aware token representations, and an attention pooling classification head to select game actions from ASCII frame representations at 31ms per decision. Trained on just 31,000 human gameplay demonstrations, it achieves 178 frags in 10 episodes (17.8 per episode) in the defend_the_center scenario, more than all tested LLMs combined (13 frags total). All agents receive equivalent input: ASCII frames and depth maps. Despite having 92,000x fewer parameters than Nemotron‑120B, our model is the only agent that actively engages enemies rather than purely evading them. These results demonstrate that small, task‑specific models trained on domain‑appropriate data can decisively outperform general‑purpose LLMs at real‑time control tasks, at a fraction of the inference cost, with deployment capability on consumer hardware.

Authors:Rui Dong, Zitong Wang, Jiaxing Li, Weihuang Zheng, Youyong Kong
Title: BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
Abstract:
Graph Neural Networks (GNNs) have been widely used in diverse brain network analysis tasks based on preprocessed functional magnetic resonance imaging (fMRI) data. However, their performances are constrained due to high feature sparsity and inherent limitations of domain knowledge within uni‑modal neurographs. Meanwhile, large language models (LLMs) have demonstrated powerful representation capabilities. Combining LLMs with GNNs presents a promising direction for brain network analysis. While LLMs and MLLMs have emerged in neuroscience, integration of LLMs with graph‑based data remains unexplored. In this work, we deal with these issues by incorporating LLM's powerful representation and generalization capabilities. Considering great cost for directly tuning LLMs, we instead function LLM as enhancer to boost GNN's performance on downstream tasks. Our method, namely BLEG, can be divided into three stages. We firstly prompt LLM to get augmented texts for fMRI graph data, then we design a LLM‑LM instruction tuning method to get enhanced textual representations at a relatively lower cost. GNN is trained together for coarsened alignment. Finally we finetune an adapter after GNN for given downstream tasks. Alignment loss between LM and GNN logits is designed to further enhance GNN's representation. Extensive experiments on different datasets confirmed BLEG's superiority.Code can be available at https://github.com/KamonRiderDR/BLEG.

Authors:Huaiyuan Qin, Muli Yang, Gabriel James Goenawan, Kai Wang, Zheng Wang, Peng Hu, Xi Peng, Hongyuan Zhu
Title: Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment
Abstract:
Existing dynamic data pruning methods often fail under noisy‑label settings, as they typically rely on per‑sample loss as the ranking criterion. This could mistakenly lead to preserving noisy samples due to their high loss values, resulting in significant performance drop. To address this, we propose AlignPrune, a noise‑robust module designed to enhance the reliability of dynamic pruning under label noise. Specifically, AlignPrune introduces the Dynamic Alignment Score (DAS), which is a loss‑trajectory‑based criterion that enables more accurate identification of noisy samples, thereby improving pruning effectiveness. As a simple yet effective plug‑and‑play module, AlignPrune can be seamlessly integrated into state‑of‑the‑art dynamic pruning frameworks, consistently outperforming them without modifying either the model architecture or the training pipeline. Extensive experiments on five widely‑used benchmarks across various noise types and pruning ratios demonstrate the effectiveness of AlignPrune, boosting accuracy by up to 6.3% over state‑of‑the‑art baselines. Our results offer a generalizable solution for pruning under noisy data, encouraging further exploration of learning in real‑world scenarios. Code is available at: https://github.com/leonqin430/AlignPrune.

Authors:Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu, Gang Wang, Wentong Cai
Title: Efficient Learned Data Compression via Dual-Stream Feature Decoupling
Abstract:
While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single‑stream architectures struggle to simultaneously capture micro‑syntactic and macro‑semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constrained by device speed mismatches, where throughput is capped by Amdahl's Law due to serial processing. To this end, we propose a Dual‑Stream Multi‑Scale Decoupler that disentangles local and global contexts to replace deep serial processing with shallow parallel streams, and incorporate a Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling. Furthermore, we design a Concurrent Stream‑Parallel Pipeline, which overcomes systemic bottlenecks to achieve full‑pipeline parallelism. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage. The code is available at https://github.com/huidong‑ma/FADE.

Authors:Ricardo Knauer, Andre Beinrucker, Erik Rodner
Title: ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations
Abstract:
Neural networks deliver impressive predictive performance across a variety of tasks, but they are often opaque in their decision‑making processes. Despite a growing interest in mechanistic interpretability, tools for systematically exploring the representations learned by neural networks in general, and tabular foundation models in particular, remain limited. In this work, we introduce ConceptTracer, an interactive application for analyzing neural representations through the lens of human‑interpretable concepts. ConceptTracer integrates two information‑theoretic measures that quantify concept saliency and selectivity, enabling researchers and practitioners to identify neurons that respond strongly to individual concepts. We demonstrate the utility of ConceptTracer on representations learned by TabPFN and show that our approach facilitates the discovery of interpretable neurons. Together, these capabilities provide a practical framework for investigating how neural networks like TabPFN encode concept‑level information. ConceptTracer is available at https://github.com/ml‑lab‑htw/concept‑tracer.

Authors:Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang
Title: MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization
Abstract:
Mixture‑of‑Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE‑specific issues, including cross‑expert redundancy, task‑agnostic importance estimation, and quantization‑induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE‑based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross‑expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state‑of‑the‑art binary methods across multiple MoE‑based LLMs and benchmarks. For example, on Qwen3‑30B‑A3B, MoBiE reduces perplexity by 52.2%, improves average zero‑shot performance by 43.4%, achieves over 2 × inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon‑zzx/MoBiE.

Authors:Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Qi Long, Li Shen
Title: Bi-Lipschitz Autoencoder With Injectivity Guarantee
Abstract:
Autoencoders are widely used for dimensionality reduction, based on the assumption that high‑dimensional data lies on low‑dimensional manifolds. Regularized autoencoders aim to preserve manifold geometry during dimensionality reduction, but existing approaches often suffer from non‑injective mappings and overly rigid constraints that limit their effectiveness and robustness. In this work, we identify encoder non‑injectivity as a core bottleneck that leads to poor convergence and distorted latent representations. To ensure robustness across data distributions, we formalize the concept of admissible regularization and provide sufficient conditions for its satisfaction. In this work, we propose the Bi‑Lipschitz Autoencoder (BLAE), which introduces two key innovations: (1) an injective regularization scheme based on a separation criterion to eliminate pathological local minima, and (2) a bi‑Lipschitz relaxation that preserves geometry and exhibits robustness to data distribution drift. Empirical results on diverse datasets show that BLAE consistently outperforms existing methods in preserving manifold structure while remaining resilient to sampling sparsity and distribution shifts. Code is available at https://github.com/qipengz/BLAE.

Authors:Yue Fang, Weibin Liao, Yuxin Guo, Jiaran Gao, Hongxin Ding, Jinyang Zhang, Xinke Jiang, Zhibang Yang, Junfeng Zhao, Yasha Wang, Liantao Ma
Title: GraphWalker: Graph-Guided In-Context Learning for Clinical Reasoning on Electronic Health Records
Abstract:
Clinical Reasoning on Electronic Health Records (EHRs) is a fundamental yet challenging task in modern healthcare. While in‑context learning (ICL) offers a promising inference‑time adaptation paradigm for large language models (LLMs) in EHR reasoning, existing methods face three fundamental challenges: (1) Perspective Limitation, where data‑driven similarity fails to align with LLM reasoning needs and model‑driven signals are constrained by limited clinical competence; (2) Cohort Awareness, as demonstrations are selected independently without modeling population‑level structure; and (3) Information Aggregation, where redundancy and interaction effects among demonstrations are ignored, leading to diminishing marginal gains. To address these challenges, we propose GraphWalker, a principled demonstration selection framework for EHR‑oriented ICL. GraphWalker (i) jointly models patient clinical information and LLM‑estimated information gain by integrating data‑driven and model‑driven perspectives, (ii) incorporates Cohort Discovery to avoid noisy local optima, and (iii) employs a Lazy Greedy Search with Frontier Expansion algorithm to mitigate diminishing marginal returns in information aggregation. Extensive experiments on multiple real‑world EHR benchmarks demonstrate that GraphWalker consistently outperforms state‑of‑the‑art ICL baselines, yielding substantial improvements in clinical reasoning performance. Our code is open‑sourced at https://github.com/PuppyKnightUniversity/GraphWalker

Authors:Hanyang Wang, Mingxuan Zhu
Title: The Detection-Extraction Gap: Models Know the Answer Before They Can Say It
Abstract:
Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that 52‑‑88% of chain‑of‑thought tokens are produced after the answer is recoverable from a partial prefix. This post‑commitment generation reveals a structural phenomenon: the detection‑extraction gap. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt‑conditioned decoding fails to extract it. We formalize this mismatch via a total‑variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix‑induced shift. Exploiting this asymmetry, we propose Black‑box Adaptive Early Exit (BAEE), which uses free continuations for both detection and extraction, truncating 70‑‑78% of serial generation while improving accuracy by 1‑‑5pp across all models. For thinking‑mode models, early exit prevents post‑commitment overwriting, yielding gains of up to 5.8pp; a cost‑optimized variant achieves 68‑‑73% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.

Authors:Peigui Qi, Kunsheng Tang, Yanpu Yu, Jialin Wu, Yide Song, Wenbo Zhou, Zhicong Huang, Cheng Hong, Weiming Zhang, Nenghai Yu
Title: VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts
Abstract:
Vision‑Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE‑extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug‑and‑play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).

Authors:Dev Arpan Desai, Shaoyi Huang, Zining Zhu
Title: Distributed Interpretability and Control for Large Language Models
Abstract:
Large language models that require multiple GPU cards to host are usually the most capable models. It is necessary to understand and steer these models, but the current technologies do not support the interpretability and steering of these models in the multi‑GPU setting as well as the single‑GPU setting. We present a practical implementation of activation‑level interpretability (logit lens) and steering (steering vector) that scales up to multi‑GPU language models. Our system implements design choices that reduce the activation memory by up to 7x and increase the throughput by up to 41x compared to a baseline on identical hardware. We demonstrate the method across LLaMA‑3.1 (8B, 70B) and Qwen‑3 (4B, 14B, 32B), sustaining 20‑100 tokens/s while collecting full layer‑wise activation trajectories for sequences of 1,500 tokens. Using label‑position steering vectors injected post‑LayerNorm, we show controllable, monotonic shifts in model outputs with a mean steerability slope of 0.702 across evaluated datasets, without fine‑tuning or additional forward passes. We release detailed benchmarks, ablations, and a reproducible instrumentation recipe to enable practical interpretability and real‑time behavioral control for frontier LLMs at https://github.com/Devdesai1901/LogitLense.

Authors:Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
Title: Neural Computers
Abstract:
We propose a new frontier: Neural Computers (NCs) that unify computation, memory, and I/O of traditional computers in a learned runtime state. Our long‑term goal is the Completely Neural Computer (CNC): the mature, general‑purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether elementary NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. We show that NCs can acquire elementary interface primitives, especially I/O alignment and short‑horizon control, while routine reuse, controlled updates, and symbolic stability remain challenging. We outline a roadmap toward CNCs, to establish a new computing paradigm beyond today's agents and conventional computers.

Authors:Wenyue Hua, Sripad Karne, Qian Xie, Armaan Agrawal, Nikos Pagonas, Kostis Kaffes, Tianyi Peng
Title: AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
Abstract:
AI agents are increasingly deployed in real‑world applications, including systems such as Manus, OpenClaw, and coding agents. Existing research has primarily focused on server‑side efficiency, proposing methods such as caching, speculative execution, traffic scheduling, and load balancing to reduce the cost of serving agentic workloads. However, as users increasingly construct agents by composing local tools, remote APIs, and diverse models, an equally important optimization problem arises on the client side. Client‑side optimization asks how developers should allocate the resources available to them, including model choice, local tools, and API budget across pipeline stages, subject to application‑specific quality, cost, and latency constraints. Because these objectives depend on the task and deployment setting, they cannot be determined by server‑side systems alone. We introduce AgentOpt, the first framework‑agnostic Python package for client‑side agent optimization. We first study model selection, a high‑impact optimization lever in multi‑step agent pipelines. Given a pipeline and a small evaluation set, the goal is to find the most cost‑effective assignment of models to pipeline roles. This problem is consequential in practice: at matched accuracy, the cost gap between the best and worst model combinations can reach 13‑32x in our experiments. To efficiently explore the exponentially growing combination space, AgentOpt implements ten search algorithms, including UCB‑E, UCB‑E with Low‑Rank Factorization, Arm Elimination, Epsilon‑LUCB, Threshold Successive Elimination, and Bayesian Optimization. Across four benchmarks, UCB‑E recovers near‑optimal accuracy while reducing evaluation budget by 62‑76% relative to brute‑force search. Code and benchmark results available at https://agentoptimizer.github.io/agentopt/.

Authors:Lin Mu, Haiyang Wang, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang
Title: TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models
Abstract:
Low‑Rank Adaptation (LoRA) enables parameter‑efficient fine‑tuning of Large Language Models (LLMs), and recent Mixture‑of‑Experts (MoE) extensions further enhance flexibility by dynamically combining multiple LoRA experts. However, existing MoE‑augmented LoRA methods assume that experts operate independently, often leading to unstable routing, expert dominance. In this paper, we propose TalkLoRA, a communication‑aware MoELoRA framework that relaxes this independence assumption by introducing expert‑level communication prior to routing. TalkLoRA equips low‑rank experts with a lightweight Talking Module that enables controlled information exchange across expert subspaces, producing a more robust global signal for routing. Theoretically, we show that expert communication smooths routing dynamics by mitigating perturbation amplification while strictly generalizing existing MoELoRA architectures. Empirically, TalkLoRA consistently outperforms vanilla LoRA and MoELoRA across diverse language understanding and generation tasks, achieving higher parameter efficiency and more balanced expert routing under comparable parameter budgets. These results highlight structured expert communication as a principled and effective enhancement for MoE‑based parameter‑efficient adaptation. Code is available at https://github.com/why0129/TalkLoRA.

Authors:Ashmal Vayani, Parth Parag Kulkarni, Joseph Fioresi, Song Wang, Mubarak Shah
Title: MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis
Abstract:
Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real‑world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain‑specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi‑agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi‑agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL‑trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image‑based medical datasets demonstrate improved diagnostic accuracy, outperforming the state‑of‑the‑art baselines. Our work lays a strong foundation for future research. Code and models are available at https://github.com/UCF‑CRCV/MedRoute/.

Authors:Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He, Wenhao Huang, Tianle Cai
Title: In-Place Test-Time Training
Abstract:
The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real‑world tasks. Test‑Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In‑Place Test‑Time Training (In‑Place TTT), a framework that seamlessly endows LLMs with Test‑Time Training ability. In‑Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop‑in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically‑grounded objective explicitly aligned with the Next‑Token‑Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk‑wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in‑place enhancement, it enables a 4B‑parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT‑related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In‑Place TTT as a promising step towards a paradigm of continual learning in LLMs.

Authors:Jean Kaddour
Title: Target Policy Optimization
Abstract:
In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy‑gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emphTarget Policy Optimization (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i \propto p_i^\,\mathrmold \exp(u_i) and fits the policy to it by cross‑entropy. The loss gradient on sampled‑completion logits is p^θ‑ q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion‑parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.

Authors:Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou, Kai Zhang
Title: The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models
Abstract:
Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real‑world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self‑assessment under in‑context learning (ICL) settings that better reflect real‑world application environments, specifically designed to scrutinize the subtle behavioral nuances induced by memory modifications. This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs. Moreover, we find that recursive modifications accumulate representational residues, triggering cognitive instability and permanently diminishing the reversibility of the model's memory state. These insights underscore the risks of current editing paradigms and highlight the pivotal role of robust memory modification in building trustworthy, long‑term sustainable LLM systems. Code is available at https://github.com/XiaojieGu/SA‑MCQ.

Authors:Ioannis Nasios
Title: Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning
Abstract:
Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi‑model framework that fuses Sentinel‑2 optical imagery with Sentinel‑1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi‑encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state‑of‑the‑art F1 score of 0.919 on landslide detection, addressing a patch‑based classification task rather than pixel‑level segmentation and operating without pre‑event Sentinel‑2 data, highlighting its effectiveness in a non‑classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical‑only, SAR‑only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in https://github.com/IoannisNasios/sentinel‑landslide‑cls.

Authors:Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng
Title: Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
Abstract:
Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision‑making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limited scalability. In this paper, we propose STEP‑HRL, a hierarchical reinforcement learning (HRL) framework that enables step‑level learning by conditioning only on single‑step transitions rather than full interaction histories. STEP‑HRL structures tasks hierarchically, using completed subtasks to represent global progress of overall task. By introducing a local progress module, it also iteratively and selectively summarizes interaction history within each subtask to produce a compact summary of local progress. Together, these components yield augmented step‑level transitions for both high‑level and low‑level policies. Experimental results on ScienceWorld and ALFWorld benchmarks consistently demonstrate that STEP‑HRL substantially outperforms baselines in terms of performance and generalization while reducing token usage. Our code is available at https://github.com/TonyStark042/STEP‑HRL.

Authors:Laurits Fredsgaard, Aaron Thomas, Michael Riis Andersen, Mikkel N. Schmidt, Mahito Sugiyama
Title: Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings
Abstract:
Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log‑likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC =0.43), while LU achieves AUC =0.85, suggesting that permutation‑based evaluation provides a more reliable quality check for generated molecules. Code is available at https://github.com/lauritsf/linearization‑uncertainty

Authors:Tõnis Lees, Tambet Matiisen
Title: Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
Abstract:
This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data augmentation, increasing the replay buffer size, and having the model play 25 percent of training games against randomly sampled past checkpoints. Over 100 self‑play iterations, the modified model demonstrated steady improvement, achieving a BayesElo rating of 1235 relative to a randomly initialized baseline. Training metrics also showed a significant decrease in policy entropy and average remaining pieces, reflecting increasingly focused and decisive play. Ultimately, the experiments confirm that AlphaZero's self‑play framework can transfer to highly asymmetric games, provided that distinct policy/value heads and robust stabilization techniques are employed.

Authors:Ali Aliev, Kamil Garifullin, Nikolay Yudin, Vera Soboleva, Alexander Molozhavenko, Ivan Oseledets, Aibek Alanov, Maxim Rakhuba
Title: OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
Abstract:
In a rapidly growing field of model training there is a constant practical interest in parameter‑efficient fine‑tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine‑tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training‑free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group‑and‑Shuffle (\mathcalGS) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a \textspectra restoration transform that restores spectral properties of the merged adapter for higher‑quality fusion. We conduct experiments in subject‑driven generation tasks showing that our technique to merge two \mathcalGS orthogonal matrices is capable of uniting concept and style features of different adapters. To the best of our knowledge, this is the first training‑free method for merging multiplicative orthogonal adapters. Code is available via the \hrefhttps://github.com/ControlGenAI/OrthoFuselink.

Authors:Jarrid Rector-Brooks, Théophile Lambert, Marta Skreta, Daniel Roth, Yueming Long, Zi-Qi Li, Xi Zhang, Miruna Cretu, Francesca-Zhoufan Li, Tanvi Ganapathy, Emily Jin, Avishek Joey Bose, Jason Yang, Kirill Neklyudov, Yoshua Bengio, Alexander Tong, Frances H. Arnold, Cheng-Hao Liu
Title: General Multimodal Protein Design Enables DNA-Encoding of Chemistry
Abstract:
Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre‑specifying catalytic residues. We introduce DISCO (DIffusion for Sequence‑structure CO‑design), a multimodal model that co‑designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference‑time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active‑site geometries. These enzymes catalyze new‑to‑nature carbene‑transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B‑H, and C(sp^3)‑H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations. Code is available at https://github.com/DISCO‑design/DISCO.

Authors:Quyet V. Do, Thinh Pham, Nguyen Nguyen, Sha Li, Pratibha Zunjare, Tu Vu
Title: $π^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models
Abstract:
We study a pipeline that curates reasoning data from initial structured data for improving long‑context reasoning in large language models (LLMs). Our approach, π^2, constructs high‑quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi‑hop analytical reasoning questions whose answers are automatically determined and verified through dual‑path code execution, and 3) back‑translating step‑by‑step structured reasoning traces as solutions of QA pairs given realistic web‑search context. Supervised fine‑tuning with \textsc\smallgpt‑oss‑20b and \textsc\smallQwen3‑4B‑Instruct‑2507 on π^2 yields consistent improvements across four long‑context reasoning benchmarks and our alike π^2‑Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self‑distillation, where \textsc\smallgpt‑oss‑20b even improves its average performance by +4.4% with its own reasoning traces, demonstrating π^2's usefulness. Our code, data, and models are open‑source at https://github.com/vt‑pi‑squared/pi‑squared.

Authors:Ximing Xing, Ziteng Xue, Zhenxi Li, Weicong Liang, Linqing Wang, Zhantao Yang, Tiankai Hang, Zijin Yin, Qinglin Lu, Chunyu Wang, Qian Yu
Title: Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
Abstract:
Recent large language models have shifted SVG generation from differentiable rendering optimization to autoregressive program synthesis. However, existing approaches still rely on generic byte‑level tokenization inherited from natural language processing, which poorly reflects the geometric structure of vector graphics. Numerical coordinates are fragmented into discrete symbols, destroying spatial relationships and introducing severe token redundancy, often leading to coordinate hallucination and inefficient long‑sequence generation. To address these challenges, we propose HiVG, a hierarchical SVG tokenization framework tailored for autoregressive vector graphics generation. HiVG decomposes raw SVG strings into structured atomic tokens and further compresses executable command‑‑parameter groups into geometry‑constrained segment tokens, substantially improving sequence efficiency while preserving syntactic validity. To further mitigate spatial mismatch, we introduce a Hierarchical Mean‑‑Noise (HMN) initialization strategy that injects numerical ordering signals and semantic priors into new token embeddings. Combined with a curriculum training paradigm that progressively increases program complexity, HiVG enables more stable learning of executable SVG programs. Extensive experiments on both text‑to‑SVG and image‑to‑SVG tasks demonstrate improved generation fidelity, spatial consistency, and sequence efficiency compared with conventional tokenization schemes. Our code is publicly available at https://github.com/ximinng/HiVG

Authors:Yasaman Kashefbahrami, Erkut Akdag, Panagiotis Meletis, Evgeniya Balmashnova, Dip Goswami, Egor Bondarau
Title: R3PM-Net: Real-time, Robust, Real-world Point Matching Network
Abstract:
Accurate Point Cloud Registration (PCR) is an important task in 3D data processing, involving the estimation of a rigid transformation between two point clouds. While deep‑learning methods have addressed key limitations of traditional non‑learning approaches, such as sensitivity to noise, outliers, occlusion, and initialization, they are developed and evaluated on clean, dense, synthetic datasets (limiting their generalizability to real‑world industrial scenarios). This paper introduces R3PM‑Net, a lightweight, global‑aware, object‑level point matching network designed to bridge this gap by prioritizing both generalizability and real‑time efficiency. To support this transition, two datasets, Sioux‑Cranfield and Sioux‑Scans, are proposed. They provide an evaluation ground for registering imperfect photogrammetric and event‑camera scans to digital CAD models, and have been made publicly available. Extensive experiments demonstrate that R3PM‑Net achieves competitive accuracy with unmatched speed. On ModelNet40, it reaches a perfect fitness score of 1 and inlier RMSE of 0.029 cm in only 0.007s, approximately 7 times faster than the state‑of‑the‑art method RegTR. This performance carries over to the Sioux‑Cranfield dataset, maintaining a fitness of 1 and inlier RMSE of 0.030 cm with similarly low latency. Furthermore, on the highly challenging Sioux‑Scans dataset, R3PM‑Net successfully resolves edge cases in under 50 ms. These results confirm that R3PM‑Net offers a robust, high‑speed solution for critical industrial applications, where precision and real‑time performance are indispensable. The code and datasets are available at https://github.com/YasiiKB/R3PM‑Net.

Authors:Gowrav Vishwakarma, Christopher J. Agostino
Title: Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
Abstract:
We present Phase‑Associative Memory (PAM), a recurrent sequence model in which all representations are complex‑valued, associations accumulate in a matrix state S_t \in \mathbbC^d × d via outer products, and retrieval operates through the conjugate inner product K_t^ \cdot Q_t / \sqrtd. At ~100M parameters on WikiText‑103, PAM reaches validation perplexity 30.0, within ~10% of a matched transformer (27.1) trained under identical conditions, despite 4× arithmetic overhead from complex computation and no custom kernels. We trace the experimental path from vector‑state models, where holographic binding fails due to the O(1/\sqrtn) capacity degradation of superposed associations, to the matrix state that resolves it. The competitiveness of an architecture whose native operations are complex‑valued superposition and conjugate retrieval is consistent with recent empirical evidence that semantic interpretation in both humans and large language models exhibits non‑classical contextuality, and we discuss what this implies for the choice of computational formalism in language modeling.

Authors:Yiwen Song, Yale Song, Tomas Pfister, Jinsung Yoon
Title: PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing
Abstract:
Synthesizing unstructured research materials into manuscripts is an essential yet under‑explored challenge in AI‑driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi‑agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre‑writing materials into submission‑ready LaTeX manuscripts, including comprehensive literature synthesis and generated visuals, such as plots and conceptual diagrams. To evaluate performance, we present PaperWritingBench, the first standardized benchmark of reverse‑engineered raw materials from 200 top‑tier AI conference papers, alongside a comprehensive suite of automated evaluators. In side‑by‑side human evaluations, PaperOrchestra significantly outperforms autonomous baselines, achieving an absolute win rate margin of 50%‑68% in literature review quality, and 14%‑38% in overall manuscript quality.

Authors:Peter Balogh
Title: Darkness Visible: Reading the Exception Handler of a Language Model
Abstract:
The final MLP of GPT‑2 Small exhibits a fully legible routing program ‑‑ 27 named neurons organized into a three‑tier exception handler ‑‑ while the knowledge it routes remains entangled across ~3,040 residual neurons. We decompose all 3,072 neurons (to numerical precision) into: 5 fused Core neurons that reset vocabulary toward function words, 10 Differentiators that suppress wrong candidates, 5 Specialists that detect structural boundaries, and 7 Consensus neurons that each monitor a distinct linguistic dimension. The consensus‑exception crossover ‑‑ where MLP intervention shifts from helpful to harmful ‑‑ is statistically sharp (bootstrap 95% CIs exclude zero at all consensus levels; crossover between 4/7 and 5/7). Three experiments show that "knowledge neurons" (Dai et al., 2022), at L11 of this model, function as routing infrastructure rather than fact storage: the MLP amplifies or suppresses signals already present in the residual stream from attention, scaling with contextual constraint. A garden‑path experiment reveals a reversed garden‑path effect ‑‑ GPT‑2 uses verb subcategorization immediately, consistent with the exception handler operating at token‑level predictability rather than syntactic structure. This architecture crystallizes only at the terminal layer ‑‑ in deeper models, we predict equivalent structure at the final layer, not at layer 11. Code and data: https://github.com/pbalogh/transparent‑gpt2

Authors:Qing Zhou, Bingxuan Zhao, Tao Yang, Hongyuan Zhang, Junyu Gao, Qi Wang
Title: Batch Loss Score for Dynamic Data Pruning
Abstract:
Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While per‑sample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first‑order low‑pass filter, attenuating high‑frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (three‑line injection) and readily adapts existing per‑sample loss‑based methods (one‑line proxy). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune 20%‑50% of samples across 14 datasets, 11 tasks and 18 models, highlighting its utility and broad applicability, especially for complex scenarios where per‑sample loss is difficult to access. Code is available at https://github.com/mrazhou/BLS.

Authors:Fatemeh Khadem, Sajad Mousavi, Yi Fang, Yuhong Liu
Title: DP-OPD: Differentially Private On-Policy Distillation for Language Models
Abstract:
Large language models (LLMs) are increasingly adapted to proprietary and domain‑specific corpora that contain sensitive information, creating a tension between formal privacy guarantees and efficient deployment through model compression. Differential privacy (DP), typically enforced via DP‑SGD, provides record‑level protection but often incurs substantial utility loss in autoregressive generation, where optimization noise can amplify exposure bias and compounding errors along long rollouts. Existing approaches to private distillation either apply DP‑SGD to both teacher and student, worsening computation and the privacy‑‑utility tradeoff, or rely on DP synthetic text generation from a DP‑trained teacher, avoiding DP on the student at the cost of DP‑optimizing a large teacher and introducing an offline generation pipeline. We propose Differentially Private On‑Policy Distillation (DP‑OPD), a synthesis‑free framework that enforces privacy solely through DP‑SGD on the student while leveraging a frozen teacher to provide dense token‑level targets on \emphstudent‑generated trajectories. DP‑OPD instantiates this idea via \emphprivate generalized knowledge distillation on continuation tokens. Under a strict privacy budget (\varepsilon=2.0), DP‑OPD improves perplexity over DP fine‑tuning and off‑policy DP distillation, and outperforms synthesis‑based DP distillation (Yelp: 44.15\rightarrow41.68; BigPatent: 32.43\rightarrow30.63), while substantially simplifying the training pipeline. In particular, DP‑OPD collapses private compression into a single DP student‑training loop by eliminating DP teacher training and offline synthetic text generation. Code will be released upon publication at https://github.com/khademfatemeh/dp_opd.

Authors:Abu Noman Md Sakib, Zhensen Wang, Merjulah Roby, Zijie Zhang
Title: Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition
Abstract:
Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label‑preserving perturbations. We implement this metric using a pre‑trained BERT model on the SST‑2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model's behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS‑XAI‑Stability.

Authors:Seoyoung Park, Haemin Lee, Hankook Lee
Title: Is Prompt Selection Necessary for Task-Free Online Continual Learning?
Abstract:
Task‑free online continual learning has recently emerged as a realistic paradigm for addressing continual learning in dynamic, real‑world environments, where data arrive in a non‑stationary stream without clear task boundaries and can only be observed once. To consider such challenging scenarios, many recent approaches have employed prompt selection, an adaptive strategy that selects prompts from a pool based on input signals. However, we observe that such selection strategies often fail to select appropriate prompts, yielding suboptimal results despite additional training of key parameters. Motivated by this observation, we propose a simple yet effective SinglePrompt that eliminates the need for prompt selection and focuses on classifier optimization. Specifically, we simply (i) inject a single prompt into each self‑attention block, (ii) employ a cosine similarity‑based logit design to alleviate the forgetting effect inherent in the classifier weights, and (iii) mask logits for unexposed classes in the current minibatch. With this simple task‑free design, our framework achieves state‑of‑the‑art performance across various online continual learning benchmarks. Source code is available at https://github.com/efficient‑learning‑lab/SinglePrompt.

Authors:Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda
Title: Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
Abstract:
Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley‑Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non‑preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non‑preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.

Authors:Saurav Jha, Maryam Hashemzadeh, Ali Saheb Pasand, Ali Parviz, Min-Joong Lee, Boris Knyazev
Title: REAM: Merging Improves Pruning of Experts in LLMs
Abstract:
Mixture‑of‑Experts (MoE) large language models (LLMs) are among the top‑performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router‑weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router‑weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple‑choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade‑off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade‑off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.

Authors:Hristo Petkov, Calum MacLellan, Feng Dong
Title: DAGAF: A directed acyclic generative adversarial framework for joint structure learning and tabular data synthesis
Abstract:
Understanding the causal relationships between data variables can provide crucial insights into the construction of tabular datasets. Most existing causality learning methods typically focus on applying a single identifiable causal model, such as the Additive Noise Model (ANM) or the Linear non‑Gaussian Acyclic Model (LiNGAM), to discover the dependencies exhibited in observational data. We improve on this approach by introducing a novel dual‑step framework capable of performing both causal structure learning and tabular data synthesis under multiple causal model assumptions. Our approach uses Directed Acyclic Graphs (DAG) to represent causal relationships among data variables. By applying various functional causal models including ANM, LiNGAM and the Post‑Nonlinear model (PNL), we implicitly learn the contents of DAG to simulate the generative process of observational data, effectively replicating the real data distribution. This is supported by a theoretical analysis to explain the multiple loss terms comprising the objective function of the framework. Experimental results demonstrate that DAGAF outperforms many existing methods in structure learning, achieving significantly lower Structural Hamming Distance (SHD) scores across both real‑world and benchmark datasets (Sachs: 47%, Child: 11%, Hailfinder: 5%, Pathfinder: 7% improvement compared to state‑of‑the‑art), while being able to produce diverse, high‑quality samples.

Authors:Jinhao Pan, Bowen Wei, Ziwei Zhu
Title: A Logical-Rule Autoencoder for Interpretable Recommendations
Abstract:
Most deep learning recommendation models operate as black boxes, relying on latent representations that obscure their decision process. This lack of intrinsic interpretability raises concerns in applications that require transparency and accountability. In this work, we propose a Logical‑rule Interpretable Autoencoder (LIA) for collaborative filtering that is interpretable by design. LIA introduces a learnable logical rule layer in which each rule neuron is equipped with a gate parameter that automatically selects between AND and OR operators during training, enabling the model to discover diverse logical patterns directly from data. To support functional completeness without doubling the input dimensionality, LIA encodes negation through the sign of connection weights, providing a parameter‑efficient mechanism for expressing both positive and negated item conditions within each rule. By learning explicit, human‑readable reconstruction rules, LIA allows users to directly trace the decision process behind each recommendation. Extensive experiments show that our method achieves improved recommendation performance over traditional baselines while remaining fully interpretable. Code and data are available at https://github.com/weibowen555/LIA.

Authors:Yancheng Huang, Changsheng Wang, Chongyu Fan, Yicheng Lang, Bingqi Shang, Yang Zhang, Mingyi Hong, Qing Qu, Alvaro Velasquez, Sijia Liu
Title: Subspace Control: Turning Constrained Model Steering into Controllable Spectral Optimization
Abstract:
Foundation models, such as large language models (LLMs), are powerful but often require customization before deployment to satisfy practical constraints such as safety, privacy, and task‑specific requirements, leading to "constrained" optimization problems for model steering and adaptation. However, solving such problems remains largely underexplored and is particularly challenging due to interference between the primary objective and constraint objectives during optimization. In this paper, we propose a subspace control framework for constrained model training. Specifically, (i) we first analyze, from a model merging perspective, how spectral cross‑task interference arises and show that it can be resolved via a one‑shot solution that orthogonalizes the merged subspace; (ii) we establish a connection between this solution and gradient orthogonalization in the spectral optimizer Muon; and (iii) building on these insights, we introduce SIFT (spectral interference‑free training), which leverages a localization scheme to selectively intervene during optimization, enabling controllable updates that mitigate objective‑constraint conflicts. We evaluate SIFT across four representative applications: (a) machine unlearning, (b) safety alignment, (c) text‑to‑speech adaptation, and (d) hallucination mitigation. Compared to both control‑based and control‑free baselines, SIFT consistently achieves substantial and robust performance improvements across all tasks. Code is available at https://github.com/OPTML‑Group/SIFT.

Authors:Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Title: ClawArena: Benchmarking AI Agents in Evolving Information Environments
Abstract:
AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single‑authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi‑channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi‑source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14‑category question taxonomy. Two question formats, multi‑choice (set‑selection) and shell‑based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1,879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self‑evolving skill frameworks can partially close model‑capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming‑lab/ClawArena.

Authors:Milo Coombs
Title: Spectral Path Regression: Directional Chebyshev Harmonics for Interpretable Tabular Learning
Abstract:
Classical approximation bases such as Chebyshev polynomials provide principled and interpretable representations, but their multivariate tensor‑product constructions scale exponentially with dimension and impose axis‑aligned structure that is poorly matched to real tabular data. We address this by replacing tensorised oscillations with directional harmonic modes of the form \cos(\mathbfm^\top\arccos(\mathbfx)), which organise multivariate structure by direction in angular space rather than by coordinate index. This representation yields a discrete spectral regression model in which complexity is controlled by selecting a small number of structured frequency vectors (spectral paths), and training reduces to a single closed‑form ridge solve with no iterative optimisation. Experiments on standard continuous‑feature tabular regression benchmarks show that the resulting models achieve accuracy competitive with strong nonlinear baselines while remaining compact, computationally efficient, and explicitly interpretable through analytic expressions of learned feature interactions.

Authors:Hang Xu, Ling Yue, Chaoqian Ouyang, Libin Zheng, Shaowu Pan, Shimin Di, Min-Ling Zhang
Title: FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
Abstract:
Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM‑based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence‑grounded reviewing system that combines claim extraction, literature positioning, and execution‑based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision‑maker, but as a tool for gathering evidence and helping reviewers produce more evidence‑grounded assessments. The code is public at https://github.com/DEFENSE‑SEU/Review‑Assistant.

Authors:Nahyuk Lee, Zhiang Chen, Marc Pollefeys, Sunghwan Hong
Title: TORA: Topological Representation Alignment for 3D Shape Assembly
Abstract:
Flow‑matching methods for 3D shape assembly learn point‑wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross‑part interactions should drive the motion. We introduce TORA, a topology‑first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow‑matching backbone during training. We first realize this via simple instantiation, token‑wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry‑ and contact‑centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9×) and improved accuracy in‑distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter‑object assembly demonstrate state‑of‑the‑art performance, with particularly pronounced gains in zero‑shot transfer to unseen real‑world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.

Authors:Yifu Ding, Xinhao Zhang, Jinyang Guo
Title: Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
Abstract:
Transformer‑based large language models (LLMs) have demonstrated remarkable performance across a wide range of real‑world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations of high‑precision operations. In this work, we present a low‑bit mixed‑precision attention kernel using the microscaling floating‑point (MXFP) data format, utilizing the computing capability on next‑generation GPU architectures. Our Diagonal‑Tiled Mixed‑Precision Attention (DMA) incorporates two kinds of low‑bit computation at the tiling‑level, and is a delicate fused kernel implemented using Triton, exploiting hardware‑level parallelism and memory efficiency to enable fast and efficient inference without compromising model performance. Extensive empirical evaluations on NVIDIA B200 GPUs show that our kernel maintains generation quality with negligible degradation, and meanwhile achieves significant speedup by kernel fusion. We release our code at https://github.com/yifu‑ding/MP‑Sparse‑Attn.

Authors:Indar Kumar, Girish Karhana, Sai Krishna Jasti, Ankit Hemant Lade
Title: Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look
Abstract:
Effective ride‑hailing dispatch requires anticipating demand patterns that vary substantially across time‑of‑day, day‑of‑week, season, and special events. We propose a regime‑calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a six‑metric similarity ensemble (Kolmogorov‑Smirnov, Wasserstein‑1, feature distance, variance ratio, event pattern, temporal proximity), and (iii) uses the resulting calibrated demand prior to drive both an LP‑based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional‑only subset is strongest on mean wait, while the full ensemble is retained as a robustness‑oriented default. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]%; Friedman chi‑sq = 80.0, p = 4.25e‑18; Cohen's d = 7.5‑29.9 across scenarios). The improvement extends to the tail: P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409 (7.3% relative). The two contributions compose multiplicatively and are independently validated: calibration provides 16.9% reduction; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction via NYC‑built regime library), and is robust across fleet sizes (32‑47% improvement for 0.5‑2x fleet scaling). We provide comprehensive ablation studies, formal statistical tests, and routing‑fidelity validation with OSRM.

Authors:Indar Kumar, Akanksha Tiwari
Title: Regime-Calibrated Demand Priors for Ride-Hailing Fleet Dispatch and Repositioning
Abstract:
Effective ride‑hailing dispatch requires anticipating demand patterns that vary substantially across time‑of‑day, day‑of‑week, season, and special events. We propose a regime‑calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a similarity ensemble combining Kolmogorov‑Smirnov distance, Wasserstein‑1 distance, feature distance, variance ratio, event pattern similarity, and temporal proximity, and (iii) uses the resulting calibrated demand prior to drive both an LP‑based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional‑only metric subset achieves the strongest mean‑wait reduction, while the full ensemble is retained as a robustness‑oriented default that preserves calendar and event context. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]; Friedman chi‑squared = 80.0, p = 4.25e‑18; Cohen's d = 7.5‑29.9). P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409. The two contributions compose multiplicatively: calibration provides 16.9% reduction relative to the replay baseline; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction using the NYC‑built regime library without retraining), and is robust across fleet sizes (32‑47% improvement for 0.5x‑2.0x fleet scaling). Code is available at https://github.com/IndarKarhana/regime‑calibrated‑dispatch.

Authors:Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, Peihao Wu, Liang Xiao, Ruochuan Liu, Bryan Dai, Bin Dong
Title: Automated Conjecture Resolution with Formal Verification
Abstract:
Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research‑level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework for tackling research‑level mathematical problems that integrates natural language reasoning with formal verification, enabling end‑to‑end problem solving with minimal human intervention. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas mimics the workflow of human mathematicians by combining reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with our formal theorem search engine LeanSearch, translates informal arguments into formalized Lean 4 projects through structured task decomposition, iterative refinement, and automated proof synthesis, ensuring machine‑checkable correctness. Using this framework, we automatically resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross‑domain mathematical techniques, while the formal agent is capable of autonomously filling nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, substantially reduce human effort, and offer a concrete instantiation of human‑AI collaborative mathematical research.

Authors:Kitsuya Azuma, Takayuki Nishio
Title: BlazeFL: Fast and Deterministic Federated Learning Simulation
Abstract:
Federated learning (FL) research increasingly relies on single‑node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single‑node FL simulation that alleviates this trade‑off through free‑threaded shared‑memory execution and deterministic randomness management. BlazeFL uses thread‑based parallelism with in‑memory parameter exchange between the server and clients, avoiding serialization and inter‑process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL‑managed generators, this design yields bitwise‑identical results across repeated high‑concurrency runs in both thread‑based and process‑based modes. In CIFAR‑10 image‑classification experiments, BlazeFL substantially reduces execution time relative to a widely used open‑source baseline, achieving up to 3.1× speedup on communication‑dominated workloads while preserving a lightweight dependency footprint. Our open‑source implementation is available at: https://github.com/kitsuyaazuma/blazefl.

Authors:Viet Dung Nguyen, Yuhang Song, Anh Nguyen, Jamison Heard, Reynold Bailey, Alexander Ororbia
Title: Optimizing Neurorobot Policy under Limited Demonstration Data through Preference Regret
Abstract:
Robot reinforcement learning from demonstrations (RLfD) assumes that expert data is abundant; this is usually unrealistic in the real world given data scarcity as well as high collection cost. Furthermore, imitation learning algorithms assume that the data is independently and identically distributed, which ultimately results in poorer performance as gradual errors emerge and compound within test‑time trajectories. We address these issues by introducing the "master your own expertise" (MYOE) framework, a self‑imitation framework that enables robotic agents to learn complex behaviors from limited demonstration data samples. Inspired by human perception and action, we propose and design what we call the queryable mixture‑of‑preferences state space model (QMoP‑SSM), which estimates the desired goal at every time step. These desired goals are used in computing the "preference regret", which is used to optimize the robot control policy. Our experiments demonstrate the robustness, adaptability, and out‑of‑sample performance of our agent compared to other state‑of‑the‑art RLfD schemes. The GitHub repository that supports this work can be found at: https://github.com/rxng8/neurorobot‑preference‑regret‑learning.

Authors:Maharshi Savdhariya
Title: NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure
Abstract:
BitNet b1.58 (Ma et al., 2024) demonstrates that large language models can operate entirely on ternary weights ‑1, 0, +1, yet no native binary wire format exists for such models. NativeTernary closes this gap. We present NativeTernary, a binary encoding scheme that partitions the 2‑bit pair space into three data symbols representing ternary values ‑‑ either balanced ‑1, 0, +1 or unsigned 0, 1, 2 ‑‑ and a reserved structural delimiter. The central contribution is the use of unary run‑length encoding to represent semantic hierarchy depth: a sequence of N consecutive delimiter pairs denotes a boundary of level N, encoding character, word, sentence, paragraph, and topic boundaries at cost 2, 4, 6, 8, and 10 bits respectively ‑‑ proportional to boundary rarity. The choice of which 2‑bit pair serves as the delimiter is a design parameter: 11 is the primary embodiment, offering simple OR‑gate detection; 00 is an alternative embodiment optimised for ultra‑low‑power CMOS systems, minimising switching activity. All four bit‑pair choices are covered by the patent claims. We present three encoding variants: (1) the primary scheme with 11 as sole delimiter; (2) a dual‑starter variant where both 10 and 11 initiate distinct symbol namespaces; and (3) an analysis of unsigned versus balanced ternary data mappings. We describe a path toward ternary‑native general computing infrastructure requiring no hardware changes, and outline applications spanning ternary neural network weight storage, hierarchical natural language encoding, edge computing, IoT and satellite telemetry, industrial sensors, automotive systems, medical devices, gaming, and financial tick data. The decoder is a 10‑line stateless state machine resilient to bitstream corruption.

Authors:Tomek Kaszyński
Title: Emergent Compositional Communication for Latent World Properties
Abstract:
Can multi‑agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel‑Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near‑perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi‑agent structure ‑‑ not bandwidth or temporal coverage ‑‑ drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially‑visible ramp physics (98.3% vs 95.1%), while V‑JEPA 2 dominates on dynamics‑only collision physics (87.4% vs 77.7%, d=2.74). Scale‑matched (d=3.37) and frame‑matched (d=6.53) controls attribute this gap entirely to video‑native pretraining. The frozen protocol supports action‑conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass‑comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent‑scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).

Authors:Carmine Valentino, Federico Pichi, Francesco Colace, Dajana Conte, Gianluigi Rozza
Title: Integrating Artificial Intelligence, Physics, and Internet of Things: A Framework for Cultural Heritage Conservation
Abstract:
The conservation of cultural heritage increasingly relies on integrating technological innovation with domain expertise to ensure effective monitoring and predictive maintenance. This paper presents a novel framework to support the preservation of cultural assets, combining Internet of Things (IoT) and Artificial Intelligence (AI) technologies, enhanced with the physical knowledge of phenomena. The framework is structured into four functional layers that permit the analysis of 3D models of cultural assets and elaborate simulations based on the knowledge acquired from data and physics. A central component of the proposed framework consists of Scientific Machine Learning, particularly Physics‑Informed Neural Networks (PINNs), which incorporate physical laws into deep learning models. To enhance computational efficiency, the framework also integrates Reduced Order Methods (ROMs), specifically Proper Orthogonal Decomposition (POD), and is also compatible with classical Finite Element (FE) methods. Additionally, it includes tools to automatically manage and process 3D digital replicas, enabling their direct use in simulations. The proposed approach offers three main contributions: a methodology for processing 3D models of cultural assets for reliable simulation; the application of PINNs to combine data‑driven and physics‑based approaches in cultural heritage conservation; and the integration of PINNs with ROMs to efficiently model degradation processes influenced by environmental and material parameters. The reproducible and open‑access experimental phase exploits simulated scenarios on complex and real‑life geometries to test the efficacy of the proposed framework in each of its key components, allowing the possibility of dealing with both direct and inverse problems. Code availability: https://github.com/valc89/PhysicsInformedCulturalHeritage

Authors:David Ilić, Kostadin Cvejoski, David Stanojević, Evgeny Grigorenko
Title: Learning the Signature of Memorization in Autoregressive Language Models
Abstract:
All prior membership inference attacks for fine‑tuned language models use hand‑crafted heuristics (e.g., loss thresholding, Min‑K%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine‑tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine‑tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer‑based models. It transfers zero‑shot to Mamba (state‑space), RWKV‑4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held‑out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross‑entropy loss. Even simple likelihood‑based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT‑MIA), captures this signal most effectively by reframing membership inference as sequence classification over per‑token distributional statistics. On transformers, LT‑MIA achieves 2.8× higher TPR at 0.1% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains‑Research/learned‑mia.

Authors:Nikita Vassilyev, William Berrios, Ruowang Zhang, Bo Han, Douwe Kiela, Shikib Mehri
Title: Reflective Context Learning: Studying the Optimization Primitives of Context Space
Abstract:
Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, including credit assignment, overfitting, forgetting, local optima, and high‑variance learning signals, persist whether the learned object lies in parameter space or context space. While these challenges are well understood in classical machine learning optimization, they remain underexplored in context space, leading current methods to be fragmented and ad hoc. We present Reflective Context Learning (RCL), a unified framework for agents that learn through repeated interaction, reflection on behavior and failure modes, and iterative updates to context. In RCL, reflection converts trajectories and current context into a directional update signal analogous to gradients, while mutation applies that signal to improve future behavior in context space. We recast recent context‑optimization approaches as instances of this shared learning problem and systematically extend them with classical optimization primitives, including batching, improved credit‑assignment signal, auxiliary losses, failure replay, and grouped rollouts for variance reduction. On AppWorld, BrowseComp+, and RewardBench2, these primitives improve over strong baselines, with their relative importance shifting across task regimes. We further analyze robustness to initialization, the effects of batch size, sampling and curriculum strategy, optimizer‑state variants, and the impact of allocating stronger or weaker models to different optimization components. Our results suggest that learning through context updates should be treated not as a set of isolated algorithms, but as an optimization problem whose mechanisms can be studied systematically and improved through transferable principles.

Authors:Santosh Mohan Rajkumar, Dibyasri Barman, Kumar Vikram Singh, Debdipta Goswami
Title: On Data-Driven Koopman Representations of Nonlinear Delay Differential Equations
Abstract:
This work establishes a rigorous bridge between infinite‑dimensional delay dynamics and finite‑dimensional Koopman learning, with explicit and interpretable error guarantees. While Koopman analysis is well‑developed for ordinary differential equations (ODEs) and partially for partial differential equations (PDEs), its extension to delay differential equations (DDEs) remains limited due to the infinite‑dimensional phase space of DDEs. We propose a finite‑dimensional Koopman approximation framework based on history discretization and a suitable reconstruction operator, enabling a tractable representation of the Koopman operator via kernel‑based extended dynamic mode decomposition (kEDMD). Deterministic error bounds are derived for the learned predictor, decomposing the total error into contributions from history discretization, kernel interpolation, and data‑driven regression. Additionally, we develop a kernel‑based reconstruction method to recover discretized states from lifted Koopman coordinates, with provable guarantees. Numerical results demonstrate convergence of the learned predictor with respect to both discretization resolution and training data, supporting reliable prediction and control of delay systems.

Authors:Md. Rashadul Islam
Title: Explainable Machine Learning Reveals 12-Fold Ucp1 Upregulation and Thermogenic Reprogramming in Female Mouse White Adipose Tissue After 37 Days of Microgravity: First AI/ML Analysis of NASA OSD-970
Abstract:
Microgravity induces profound metabolic adaptations in mammalian physiology, yet the molecular mechanisms governing thermogenesis in female white adipose tissue (WAT) remain poorly characterized. This paper presents the first machine learning (ML) analysis of NASA Open Science Data Repository (OSDR) dataset OSD‑970, derived from the Rodent Research‑1 (RR‑1) mission. Using RT‑qPCR data from 89 adipogenesis and thermogenesis pathway genes in gonadal WAT of 16 female C57BL/6J mice (8 flight, 8 ground control) following 37 days aboard the International Space Station (ISS), we applied differential expression analysis, multiple ML classifiers with Leave‑One‑Out Cross‑Validation (LOO‑CV), and Explainable AI via SHapley Additive exPlanations (SHAP). The most striking finding is a dramatic 12.21‑fold upregulation of Ucp1 (Delta‑Delta‑Ct = ‑3.61, p = 0.0167) in microgravity‑exposed WAT, accompanied by significant activation of the thermogenesis pathway (mean pathway fold‑change = 3.24). The best‑performing model (Random Forest with top‑20 features) achieved AUC = 0.922, Accuracy = 0.812, and F1 = 0.824 via LOO‑CV. SHAP analysis consistently ranked Ucp1 among the top predictive features, while Angpt2, Irs2, Jun, and Klf‑family transcription factors emerged as dominant consensus classifiers. Principal component analysis (PCA) revealed clear separation between flight and ground samples, with PC1 explaining 69.1% of variance. These results suggest rapid thermogenic reprogramming in female WAT as a compensatory response to microgravity. This study demonstrates the power of explainable AI for re‑analysis of newly released NASA space biology datasets, with direct implications for female astronaut health on long‑duration missions and for Earth‑based obesity and metabolic disease research.

Authors:Haseeb Tariq, Marwan Hassani
Title: Extracting Money Laundering Transactions from Quasi-Temporal Graph Representation
Abstract:
Money laundering presents a persistent challenge for financial institutions worldwide, while criminal organizations constantly evolve their tactics to bypass detection systems. Traditional anti‑money laundering approaches mainly rely on predefined risk‑based rules, leading to resource‑intensive investigations and high numbers of false positive alerts. In order to restrict operational costs from exploding, while billions of transactions are being processed every day, financial institutions are investing in more sophisticated mechanisms to improve existing systems. In this paper, we present ExSTraQt (EXtract Suspicious TRAnsactions from Quasi‑Temporal graph representation), an advanced supervised learning approach to detect money laundering (or suspicious) transactions in financial datasets. Our proposed framework excels in performance, when compared to the state‑of‑the‑art AML (Anti Money Laundering) detection models. The key strengths of our framework are sheer simplicity, in terms of design and number of parameters; and scalability, in terms of the computing and memory requirements. We evaluated our framework on transaction‑level detection accuracy using a real dataset; and a set of synthetic financial transaction datasets. We consistently achieve an uplift in the F1 score for most datasets, up to 1% for the real dataset; and more than 8% for one of the synthetic datasets. We also claim that our framework could seamlessly complement existing AML detection systems in banks. Our code and datasets are available at https://github.com/mhaseebtariq/exstraqt.

Authors:Giyeong Oh, Junghyun Lee, Jaehyun Park, Youngjae Yu, Wonho Bae, Junhyug Noh
Title: Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
Abstract:
Modern LLMs inherit strong priors from web‑scale pretraining, which can limit the headroom of post‑training data‑selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on‑policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty‑based APL against Random across harmlessness, helpfulness, and instruction‑following settings, utilizing both reward models and LLM‑as‑a‑judge proxies. We find that APL yields negligible improvements in proxy win‑rates compared to Random. Crucially, we observe a dissociation where win‑rate improves even as general capability ‑‑ measured by standard benchmarks ‑‑ degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre‑trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random‑vs‑apl.

Authors:Mirali Purohit, Bimal Gajera, Irish Mehta, Bhanu Tokas, Jacob Adler, Steven Lu, Scott Dickenshied, Serina Diniega, Brian Bue, Umaa Rebbapragada, Hannah Kerner
Title: MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
Abstract:
We introduce MOMO, the first multi‑sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large‑scale, high‑quality corpus of ~ 12 million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars‑Bench. MOMO achieves better overall performance compared to ImageNet pre‑trained, earth observation foundation model, sensor‑specific pre‑training, and fully‑supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi‑resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: https://github.com/kerner‑lab/MOMO.

Authors:Patrick Pynadath, Jiaxin Shi, Ruqi Zhang
Title: Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Abstract:
Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT‑2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity's sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at https://patrickpynadath1.github.io/blog/eval_methodology/.

Authors:Yasushi Nishida
Title: AXELRAM: Quantize Once, Never Dequantize
Abstract:
We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design‑time fixed codebook: orthogonal‑transform‑based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit‑width b, not on input data. The asymmetric path design ‑‑ transform on write, table‑lookup on read with no inverse transform ‑‑ reduces per‑query multiplications by 102.4x (a mathematical identity). Through multi‑seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5‑3B), while others (LLaMA‑3.1‑8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer‑wise norm heterogeneity and propose a gradient‑free sign pattern selection (200 candidates, 8 calibration samples, one‑time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.

Authors:Gonzalo Uribarri
Title: ROMAN: A Multiscale Routing Operator for Convolutional Time Series Models
Abstract:
We introduce ROMAN (ROuting Multiscale representAtioN), a deterministic operator for time series that maps temporal scale and coarse temporal position into an explicit channel structure while reducing sequence length. ROMAN builds an anti‑aliased multiscale pyramid, extracts fixed‑length windows from each scale, and stacks them as pseudochannels, yielding a compact representation on which standard convolutional classifiers can operate. In this way, ROMAN provides a simple mechanism to control the inductive bias of downstream models: it can reduce temporal invariance, make temporal pooling implicitly coarse‑position‑aware, and expose multiscale interactions through channel mixing, while often improving computational efficiency by shortening the processed time axis. We formally analyze the ROMAN operator and then evaluate it in two complementary ways by measuring its impact as a preprocessing step for four representative convolutional classifiers: MiniRocket, MultiRocket, a standard CNN‑based classifier, and a fully convolutional network (FCN) classifier. First, we design synthetic time series classification tasks that isolate coarse position awareness, long‑range correlation, multiscale interaction, and full positional invariance, showing that ROMAN behaves consistently with its intended mechanism and is most useful when class information depends on temporal structure that standard pooled convolution tends to suppress. Second, we benchmark the same models with and without ROMAN on long‑sequence subsets of the UCR and UEA archives, showing that ROMAN provides a practically useful alternative representation whose effect on accuracy is task‑dependent, but whose effect on efficiency is often favorable. Code is available at https://github.com/gon‑uri/ROMAN

Authors:Haiyu Wang, Yutong Wang, Jack Jiang, Sai Qian Zhang
Title: WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
Abstract:
Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low‑rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce~Weighted SVD (WSVD), which outperforms other approaches by achieving over 1.8× decoding speedup while preserving accuracy. We open source our code at: \hrefhttps://github.com/SAI‑Lab‑NYU/WSVD\texttthttps://github.com/SAI‑Lab‑NYU/WSVD

Authors:Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk
Title: Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
Abstract:
Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer‑based encoder that learns unified scene representations from multi‑view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross‑view geometric alignment and grounded view alignment to enforce cross‑view geometry and semantic consistency. Extensive low‑shot and task‑specific fine‑tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state‑of‑the‑art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/

Authors:Malik Hassanaly, Corey R. Randall, Peter J. Weddle, Paul J. Gasper, Conlain Kelly, Tanvir R. Tanim, Kandler Smith
Title: Neural posterior estimation for scalable and accurate inverse parameter inference in Li-ion batteries
Abstract:
Diagnosing the internal state of Li‑ion batteries is critical for battery research, operation of real‑world systems, and prognostic evaluation of remaining lifetime. By using physics‑based models to perform probabilistic parameter estimation via Bayesian calibration, diagnostics can account for the uncertainty due to model fitness, data noise, and the observability of any given parameter. However, Bayesian calibration in Li‑ion batteries using electrochemical data is computationally intensive even when using a fast surrogate in place of physics‑based models, requiring many thousands of model evaluations. A fully amortized alternative is neural posterior estimation (NPE). NPE shifts the computational burden from the parameter estimation step to data generation and model training, reducing the parameter estimation time from minutes to milliseconds, enabling real‑time applications. The present work shows that NPE calibrates parameters equally or more accurately than Bayesian calibration, and we demonstrate that the higher computational costs for data generation are tractable even in high‑dimensional cases (ranging from 6 to 27 estimated parameters), but the NPE method can lead to higher voltage prediction errors. The NPE method also offers several interpretability advantages over Bayesian calibration, such as local parameter sensitivity to specific regions of the voltage curve. The NPE method is demonstrated using an experimental fast charge dataset, with parameter estimates validated against measurements of loss of lithium inventory and loss of active material. The implementation is made available in a companion repository (https://github.com/NatLabRockies/BatFIT).

Authors:Pangpang Liu, Chengchun Shi, Will Wei Sun
Title: Reinforcement Learning from Human Feedback: A Statistical Perspective
Abstract:
Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine‑tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley‑Terry‑Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification. We review methods for learning reward functions from pairwise preference data and for optimizing policies through both two‑stage RLHF pipelines and emerging one‑stage approaches such as direct preference optimization. We further discuss recent extensions including reinforcement learning from AI feedback, inference‑time algorithms, and reinforcement learning from verifiable rewards, as well as benchmark datasets, evaluation protocols, and open‑source frameworks that support RLHF research. We conclude by highlighting open challenges in RLHF. An accompanying GitHub demo https://github.com/Pangpang‑Liu/RLHF_demo illustrates key components of the RLHF pipeline.

Authors:Chin-Chia Michael Yeh
Title: Matrix Profile for Time-Series Anomaly Detection: A Reproducible Open-Source Benchmark on TSB-AD
Abstract:
Matrix Profile (MP) methods are an interpretable and scalable family of distance‑based methods for time‑series anomaly detection, but strong benchmark performance still depends on design choices beyond a vanilla nearest‑neighbor profile. This technical report documents an open‑source Matrix Profile for Anomaly Detection (MMPAD) submission to TSB‑AD, a benchmark that covers both univariate and multivariate time series. The submitted system combines pre‑sorted multidimensional aggregation, efficient exclusion‑zone‑aware k‑nearest‑neighbor (kNN) retrieval for repeated anomalies, and moving‑average post‑processing. To serve as a reproducible reference for MP‑based anomaly detection on TSB‑AD, we detail the released implementation, the hyperparameter settings for the univariate and multivariate tracks, and the corresponding benchmark results. We further analyze how the system performs on the aggregate leaderboard and across specific dataset characteristics.The open‑source implementation is available at https://github.com/mcyeh/mmpad_tsb.

Authors:Mostapha Benhenda
Title: YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches
Abstract:
Forecasting startup success is notoriously difficult, partly because meaningful outcomes, such as exits, large funding rounds, and sustained revenue growth, are rare and can take years to materialize. As a result, signals are sparse and evaluation cycles are slow. Y Combinator batches offer a unique mitigation: each batch comprises around 200 startups, funded simultaneously, with evaluation at Demo Day only three months later. We introduce YC Bench, a live benchmark for forecasting early outperformance within YC batches. Using the YC W26 batch as a case study (196 startups), we measure outperformance with a Pre‑Demo Day Score, a KPI combining publicly available traction signals and web visibility. This short‑term metric enables rapid evaluation of forecasting models. As a baseline, we take Google mentions prior to the YC W26 application deadline, a simple proxy for prior brand recognition, recovering 6 of 11 top performers at YC Demo Day (55% recall). YC Bench provides a live benchmark for studying startup success forecasting, with iteration cycles measured in months rather than years. Code and Data are available on GitHub: https://github.com/benstaf/ycbench

Authors:Yonas Kassa, James Bonacci, Ping Wang
Title: Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations
Abstract:
The transformative potential of large language models (LLMs) in education, such as improving accessibility and personalized learning, is being eclipsed by significant challenges. These challenges stem from concerns that LLMs undermine academic assessment by enabling bypassing of critical thinking, leading to increased cognitive offloading. This emerging trend stresses the dual imperative of harnessing AI's educational benefits while safeguarding critical thinking and academic rigor in the evolving AI ecosystem. To this end, we introduce AI‑Sinkhole, an AI‑agent augmented DNS‑based framework that dynamically discovers, semantically classifies, and temporarily network‑wide blocks emerging LLM chatbot services during proctored exams. AI‑Sinkhole offers explainable classification via quantized LLMs (LLama 3, DeepSeek‑R1, Qwen‑3) and dynamic DNS blocking with Pi‑Hole. We also share our observations in using LLMs as explainable classifiers which achieved robust cross‑lingual performance (F1‑score > 0.83). To support future research and development in this domain initial codes with a readily deployable 'AI‑Sinkhole' blockist is available on https://github.com/AIMLEdu/ai‑sinkhole.

Authors:Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent
Title: Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
Abstract:
Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high‑dimensional one‑hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum‑image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state‑of‑the‑art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry‑heavy alternatives.

Authors:Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Title: SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Abstract:
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference‑time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero‑shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in‑context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training‑time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi‑turn task completion. A Dynamic Curriculum then evaluates each skill file's on‑policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero‑shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7% for ALFWorld and +6.6% for Search‑QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU‑REAL/SkillZero.

Authors:Hao Zhu, Di Zhou, Donna Slonim
Title: Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives
Abstract:
Understanding causal dependencies in observational data is critical for informing decision‑making. These relationships are often modeled as Bayesian Networks (BNs) and Directed Acyclic Graphs (DAGs). Existing methods, such as NOTEARS and DAG‑GNN, often face issues with scalability and stability in high‑dimensional data, especially when there is a feature‑sample imbalance. Here, we show that the denoising score matching objective of diffusion models could smooth the gradients for faster, more stable convergence. We also propose an adaptive k‑hop acyclicity constraint that improves runtime over existing solutions that require matrix inversion. We name this framework Denoising Diffusion Causal Discovery (DDCD). Unlike generative diffusion models, DDCD utilizes the reverse denoising process to infer a parameterized causal structure rather than to generate data. We demonstrate the competitive performance of DDCDs on synthetic benchmarking data. We also show that our methods are practically useful by conducting qualitative analyses on two real‑world examples. Code is available at this url: https://github.com/haozhu233/ddcd.

Authors:Xuanfeng Zhou
Title: Universal Hypernetworks for Arbitrary Models
Abstract:
Conventional hypernetworks are typically engineered around a specific base‑model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emphUniversal Hypernetwork (UHN), a fixed‑architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor‑based formulation decouples the generator architecture from target‑network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula‑regression benchmarks; (2) the same UHN supports both multi‑model generalization within a family and multi‑task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng‑Zhou/UHN.

Authors:Jeremy Herbst, Jae Hee Lee, Stefan Wermter
Title: The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Abstract:
Mixture‑of‑Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed‑forward networks (FFNs). We compare MoE experts and dense FFNs using k‑sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token‑level processors. Instead, they function as fine‑grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large‑scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis

Authors:Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi
Title: Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
Abstract:
Human‑Object Interaction (HOI) detection aims to localize human‑object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision‑Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance‑centric Context Mining Network (InCoM‑Net)‑a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance‑specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM‑Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra‑instance, inter‑instance, and global contextual cues from VLM‑derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance‑level detector features to support high‑level HOI reasoning. Extensive experiments on the HICO‑DET and V‑COCO benchmarks show that InCoM‑Net achieves state‑of‑the‑art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM‑Net.

Authors:Jaber Jaber, Osama Jaber
Title: Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Abstract:
Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per‑step diagonal modulation vector, and applies it to frozen SVD‑initialized LoRA bases, making each recurrence step input‑dependent. We combine this with gated recurrence (bias‑initialized to 88% retention) and per‑step LayerNorm for stable deep iteration. On Qwen2.5‑3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17‑layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per‑step norms) yet outperforms equivalently‑sized static per‑step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held‑out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow‑AI/ouroboros

Authors:Sebastian-Ion Nae, Radu Moldoveanu, Alexandra Stefania Ghita, Adina Magda Florea
Title: IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Abstract:
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human‑robot interaction, yet existing datasets rarely capture real‑world indoor complexity at scale. We introduce IndoorCrowd, a multi‑scene dataset for indoor human detection, instance segmentation, and multi‑object tracking, collected across four campus locations (ACS‑EC, ACS‑EG, IE‑Central, R‑Central). It comprises 31 videos (9,913 frames at 5fps) with human‑verified, per‑instance segmentation masks. A 620‑frame control subset benchmarks three foundation‑model auto‑annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's κ, AP, precision, recall, and mask IoU. A further 2,552‑frame subset supports multi‑object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT‑DETR‑L paired with ByteTrack, BoT‑SORT, and OC‑SORT. Per‑scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS‑EC, with 79.3% dense frames and a mean instance scale of 60.8px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.

Authors:Ilan Gold, Felix Fischer, Lucas Arnoldt, F. Alexander Wolf, Fabian J. Theis
Title: annbatch unlocks terabyte-scale training of biological data in anndata
Abstract:
The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine‑learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini‑batch loader native to anndata that enables out‑of‑core training directly on disk‑backed datasets. Across single‑cell transcriptomics, microscopy and whole‑genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data‑loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch

Authors:Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji
Title: Woosh: A Sound Effects Foundation Model
Abstract:
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high‑quality audio encoder/decoder model and (2) a text‑audio alignment model for conditioning, together with (3) text‑to‑audio and (4) video‑to‑audio generative models. Distilled text‑to‑audio and video‑to‑audio models are also included in the release, allowing for low‑resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio‑Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

Authors:Shuibai Zhang, Caspian Zhuang, Chihan Cui, Zhihan Yang, Fred Zhangzhi Peng, Yanxin Zhang, Haoyue Bai, Zack Jia, Yang Zhou, Guanhua Chen, Ming Liu
Title: Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
Abstract:
Diffusion language models (DLMs) enable parallel, non‑autoregressive text generation, yet existing DLM mixture‑of‑experts (MoE) models inherit token‑choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert‑choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep‑dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low‑mask‑ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low‑mask‑ratio contexts exhibit an order‑of‑magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at https://github.com/zhangshuibai/EC‑DLM.

Authors:Lincan Li, Rikuto Kotoge, Xihao Piao, Zheng Chen, Yushun Dong
Title: Optimizing EEG Graph Structure for Seizure Detection: An Information Bottleneck and Self-Supervised Learning Approach
Abstract:
Seizure detection from EEG signals is highly challenging due to complex spatiotemporal dynamics and extreme inter‑patient variability. To model them, recent methods construct dynamic graphs via statistical correlations, predefined similarity measures, or implicit learning, yet rarely account for EEG's noisy nature. Consequently, these graphs usually contain redundant or task‑irrelevant connections, undermining model performance even with state‑of‑the‑art architectures. In this paper, we present a new perspective for EEG seizure detection: jointly learning denoised dynamic graph structures and informative spatial‑temporal representations guided by the Information Bottleneck (IB). Unlike prior approaches, our graph constructor explicitly accounts for the noisy characteristics of EEG data, producing compact and reliable connectivity patterns that better support downstream seizure detection. To further enhance representation learning, we employ a self‑supervised Graph Masked AutoEncoder that reconstructs masked EEG signals based on dynamic graph context, promoting structure‑aware and compact representations aligned with the IB principle. Bringing things together, we introduce Information Bottleneck‑guided EEG SeizuRE DetectioN via SElf‑Supervised Learning (IRENE), which explicitly learns dynamic graph structures and interpretable spatial‑temporal EEG representations. IRENE addresses three core challenges: (i) Identifying the most informative nodes and edges; (ii) Explaining seizure propagation in the brain network; and (iii) Enhancing robustness against label scarcity and inter‑patient variability. Extensive experiments on benchmark EEG datasets demonstrate that our method outperforms state‑of‑the‑art baselines in seizure detection and provides clinically meaningful insights into seizure dynamics. The source code is available at https://github.com/LabRAI/IRENE.

Authors:Yiming Fan, Jun Yeon Won, Ding Zhu, Melih Sirlanci, Mahdi Khalili, Carter Yagemann
Title: EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild
Abstract:
Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real‑world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low‑ and mid‑level binary variations does not generalize to high‑level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices.

Authors:Yixiao Wang, Ting Jiang, Zishan Shao, Hancheng Ye, Jingwei Sun, Mingyuan Ma, Jianyi Zhang, Yiran Chen, Hai Li
Title: ZEUS: Accelerating Diffusion Models with Only Second-Order Predictor
Abstract:
Denoising generative models deliver high‑fidelity generation but remain bottlenecked by inference latency due to the many iterative denoiser calls required during sampling. Training‑free acceleration methods reduce latency by either sparsifying the model architecture or shortening the sampling trajectory. Current training‑free acceleration methods are more complex than necessary: higher‑order predictors amplify error under aggressive speedups, and architectural modifications hinder deployment. Beyond 2x acceleration, step skipping creates structural scarcity ‑‑ at most one fresh evaluation per local window ‑‑ leaving the computed output and its backward difference as the only causally grounded information. Based on this, we propose ZEUS, an acceleration method that predicts reduced denoiser evaluations using a second‑order predictor, and stabilizes aggressive consecutive skipping with an interleaved scheme that avoids back‑to‑back extrapolations. ZEUS adds essentially zero overhead, no feature caches, and no architectural modifications, and it is compatible with different backbones, prediction objectives, and solver choices. Across image and video generation, ZEUS consistently improves the speed‑fidelity performance over recent training‑free baselines, achieving up to 3.2x end‑to‑end speedup while maintaining perceptual quality. Our code is available at: https://github.com/Ting‑Justin‑Jiang/ZEUS.

Authors:William Hoy, Binxu Wang, Xu Pan
Title: Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training
Abstract:
Evolution Strategies (ES) have emerged as a scalable gradient‑free alternative to reinforcement learning based LLM fine‑tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single‑task and sequential continual‑learning settings. ES matches or exceeds GRPO in single‑task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off‑task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off‑task movement on weakly informative directions while still making enough progress on the task to match gradient‑based RL in downstream accuracy. These results show that gradient‑free and gradient‑based fine‑tuning can reach similarly accurate yet geometrically distinct solutions, with important consequences for forgetting and knowledge preservation. The source code is publicly available: https://github.com/Bhoy1/ESvsGRPO.

Authors:Neo Christopher Chung, Maxim Laletin
Title: Regularizing Attention Scores with Bootstrapping
Abstract:
Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision‑making process. However, attention scores are almost always non‑zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non‑zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emphAttention Regularization approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real‑world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization

Authors:Haseeb Tariq, Alen Kaja, Marwan Hassani
Title: Detecting Complex Money Laundering Patterns with Incremental and Distributed Graph Modeling
Abstract:
Money launderers take advantage of limitations in existing detection approaches by hiding their financial footprints in a deceitful manner. They manage this by replicating transaction patterns that the monitoring systems cannot easily distinguish. As a result, criminally gained assets are pushed into legitimate financial channels without drawing attention. Algorithms developed to monitor money flows often struggle with scale and complexity. The difficulty of identifying such activities is further intensified by the (persistent) inability of current solutions to control the excessive number of false positive signals produced by rigid, risk‑based rules systems. We propose a framework called ReDiRect (REduce, DIstribute, and RECTify), specifically designed to overcome these challenges. The primary contribution of our work is a novel framing of this problem in an unsupervised setting; where a large transaction graph is fuzzily partitioned into smaller, manageable components to enable fast processing in a distributed manner. In addition, we define a refined evaluation metric that better captures the effectiveness of exposed money laundering patterns. Through comprehensive experimentation, we demonstrate that our framework achieves superior performance compared to existing and state‑of‑the‑art techniques, particularly in terms of efficiency and real‑world applicability. For validation, we used the real (open source) Libra dataset and the recently released synthetic datasets by IBM Watson. Our code and datasets are available at https://github.com/mhaseebtariq/redirect.

Authors:Nathan Benjamin, A. Liam Fitzpatrick, Wei Li, Jesse Thaler
Title: Descending into the Modular Bootstrap
Abstract:
In this paper, we attempt to explore the landscape of two‑dimensional conformal field theories (2d CFTs) by efficiently searching for numerical solutions to the modular bootstrap equation using machine‑learning‑style optimization. The torus partition function of a 2d CFT is fixed by the spectrum of its primary operators and its chiral algebra, which we take to be the Virasoro algebra with c>1. We translate the requirement that this partition function is modular invariant into a loss function, which we then minimize to identify possible primary spectra. Our approach involves two technical innovations that facilitate finding reliable candidate CFTs. The first is a strategy to estimate the uncertainty associated with truncating the spectrum to the lowest dimension operators. The second is the use of a new singular‑value‑based optimizer (Sven) that is more effective than gradient descent at navigating the hierarchical structure of the loss landscape. We numerically construct candidate truncated CFT partition functions with central charges between 1 and \frac87, a range devoid of known examples, and argue that these candidates likely come from a continuous space of modular bootstrap solutions. We also provide evidence for a more stringent constraint on the spectral gap near c = 1 than the existing bound of Δ_\rm gap \le \fracc6 + \frac13.

Authors:Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang, Ashia Wilson, Tommi Jaakkola, Stephen Bates
Title: Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning
Abstract:
While test‑time scaling has enabled large language models to solve highly difficult tasks, state‑of‑the‑art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post‑trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test‑time training. Specifically, we introduce a meta‑learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level δ=0.1, ORCA improves Qwen2.5‑32B efficiency on in‑distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self‑consistency labels. Under zero‑shot out‑of‑domain settings, it improves MATH‑500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.

Authors:Jack Young
Title: S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Abstract:
Using roughly 48 execution‑verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5‑4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/‑ 1.7 pp (10 seeds). On FalconH1‑7B (Mamba‑2 hybrid), S0 reaches 71.8% +/‑ 1.3 and LoRA reaches 71.4% +/‑ 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross‑domain transfer is significant on MATH‑500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text‑to‑SQL benchmark (Spider) shows no transfer, consistent with the trajectory‑steering mechanism. A prefix‑tuning control on a pure Transformer (Qwen2.5‑3B) degrades performance by ‑13.9 pp under all nine configurations tested. On Qwen3.5, a per‑step state‑offset variant reaches +27.1 pp, above both S0 and LoRA but with per‑step inference cost. Taken together, the results show that recurrent state initialization is a strong zero‑inference‑overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0‑tuning.

Authors:Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt
Title: Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
Abstract:
As LLM agents are increasingly deployed in multi‑agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single‑agent settings, collusion is inherently a multi‑agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per‑agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in‑distribution and 0.60‑‑0.86 AUROC when transferred zero‑shot to structurally different multi‑agent scenarios and a steganographic blackjack card‑counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi‑agent interpretability: extending white‑box inspection from single models to multi‑agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text‑level monitoring for detecting multi‑agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.

Authors:Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa
Title: Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Abstract:
This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI‑driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI‑written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI‑written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite‑Bench, a benchmark of 51 papers from top‑tier venues across diverse domains published after 2025. Our experiments reveal a clear trade‑off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI‑driven paper writing and improving the understanding of its risks within the research community.

Authors:Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Di Wen, Danda Pani Paudel, Luc Van Gool, Kailun Yang
Title: ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
Abstract:
3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long‑tailed class bias and out‑of‑distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug‑and‑play method that couples prototype‑guided refinement with training‑free OOD scoring. ProOOD comprises (i) prototype‑guided semantic imputation that fills occluded regions with class‑consistent features, (ii) prototype‑guided tail mining that strengthens rare‑class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel‑level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state‑of‑the‑art performance on both in‑distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail‑class mIoU; on VAA‑KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety‑critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.

Authors:Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang
Title: Do Phone-Use Agents Respect Your Privacy?
Abstract:
We study whether phone‑use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy‑compliant behavior is not operationalized for phone‑use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy‑respecting phone use as permissioned access, minimal disclosure, and user‑controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule‑based auditing that make unnecessary permission requests, deceptive re‑disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy‑compliant task completion, and later‑session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over‑helpful execution of benign tasks, and that success‑only evaluation overestimates the deployment readiness of current phone‑use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.

Authors:Yilun Liu, Jinru Han, Sikuan Yan, Volker Tresp, Yunpu Ma
Title: Routing-Free Mixture-of-Experts
Abstract:
Standard Mixture‑of‑Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing‑Free MoE which eliminates any hard‑coded centralized designs including external routers, Softmax, Top‑K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load‑balancing framework to simultaneously optimize both expert‑balancing and token‑balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing‑Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.

Authors:Björn Roman Kohlberger
Title: Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction
Abstract:
The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B‑parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank‑sweep experiments on SmolLM2‑1.7B (ranks 32‑256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2‑4.5), identifying the learning rate schedule ‑‑ not MLP rank ‑‑ as the primary bottleneck. Rank 128 emerges as the efficiency sweet spot at 11.7x MLP compression with the lowest perplexity. GPU memory drops 46% at rank 32 while training throughput doubles.

Authors:Rajkiran Panuganti
Title: CircuitProbe: Predicting Reasoning Circuits in Transformers via Stability Zone Detection
Abstract:
Transformer language models contain localized reasoning circuits, contiguous layer blocks that improve reasoning when duplicated at inference time. Finding these circuits currently requires brute‑force sweeps costing 25 GPU hours per model. We propose CircuitProbe, which predicts circuit locations from activation statistics in under 5 minutes on CPU, providing a speedup of three to four orders of magnitude. We find that reasoning circuits come in two types: stability circuits in early layers, detected through the derivative of representation change, and magnitude circuits in late layers, detected through anomaly scoring. We validate across 9 models spanning 6 architectures, including 2025 models, confirming that CircuitProbe top predictions match or are within 2 layers of the optimal circuit in all validated cases. A scaling experiment across the Qwen 2.5 family reveals that layer duplication consistently benefits models under 3B parameters but degrades performance in 7B+ models, making this a practical scaling technique for small language models. CircuitProbe requires as few as 10 calibration examples and its predictions are stable across English, Hindi, Chinese, and French.

Authors:Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng
Title: To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
Abstract:
Retrieval‑augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge‑intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non‑parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade‑off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo‑2‑based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1‑150x the number of parameters) and retrieval store size (1‑20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open‑domain QA. We find that retrieval consistently improves performance over parametric‑only baselines across model scales and introduce a three‑dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

Authors:Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He
Title: Learning to Hint for Reinforcement Learning
Abstract:
Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non‑zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no‑hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no‑hint success, and we use this result to define a transfer‑weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no‑hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint‑based baselines, demonstrating the value of adaptive and transfer‑aware hint learning for RL. The code is available at https://github.com/Andree‑9/HiLL.

Authors:Marwan Hassani, Tamara Verbeek, Sjoerd van Straten
Title: Chameleons do not Forget: Prompt-Based Online Continual Learning for Next Activity Prediction
Abstract:
Predictive process monitoring (PPM) focuses on predicting future process trajectories, including next activity predictions. This is crucial in dynamic environments where processes change or face uncertainty. However, current frameworks often assume a static environment, overlooking dynamic characteristics and concept drifts. This results in catastrophic forgetting, where training while focusing merely on new data distribution negatively impacts the performance on previously learned data distributions. Continual learning addresses, among others, the challenges related to mitigating catastrophic forgetting. This paper proposes a novel approach called Continual Next Activity Prediction with Prompts (CNAPwP), which adapts the DualPrompt algorithm for next activity prediction to improve accuracy and adaptability while mitigating catastrophic forgetting. We introduce new datasets with recurring concept drifts, alongside a task‑specific forgetting metric that measures the prediction accuracy gap between initial occurrence and subsequent task occurrences. Extensive testing on three synthetic and two real‑world datasets representing several setups of recurrent drifts shows that CNAPwP achieves SOTA or competitive results compared to five baselines, demonstrating its potential applicability in real‑world scenarios. An open‑source implementation of our method, together with the datasets and results, is available at: https://github.com/SvStraten/CNAPwP.

Authors:Yichen Xie, Yixiao Wang, Shuqi Zhao, Cheng-En Wu, Masayoshi Tomizuka, Jianwen Xie, Hao-Shu Fang
Title: Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning
Abstract:
The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo‑demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera‑space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single‑view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real‑world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single‑view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is https://yichen928.github.io/robot_multiview.

Authors:Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz
Title: A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation
Abstract:
Chest X‑rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning‑enabled vision‑language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two‑stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero‑shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general‑domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne‑drafted reports are comparable to or better than resident‑written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI‑assisted CXR interpretation.

Authors:Michael Maynord, Minghui Liu, Cornelia Fermüller, Seongjin Choi, Yuxin Zeng, Shishir Dahal, Daniel M. Harrison
Title: Automated Detection of Multiple Sclerosis Lesions on 7-tesla MRI Using U-net and Transformer-based Segmentation
Abstract:
Ultra‑high field 7‑tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5‑3T imaging ‑ suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST‑LPA and the more recent LST‑AI ensemble, both originally developed on lower‑field data. We then trained 3D UNETR and SegFormer transformer‑based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel‑wise and lesion‑wise metrics from the BraTS 2023 framework. On the held‑out test set at native 0.5x0.5x0.5^3 resolution, 7T‑trained transformers achieved competitive overlap with LST‑AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact‑related false positives. On a held‑out 7 T test set, our best transformer model (SegFormer) achieved a voxel‑wise Dice of 0.61 and lesion‑wise Dice of 0.20, improving on the classical LST‑LPA tool (Dice 0.39, lesion‑wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small‑lesion detection. By releasing our 7T‑trained models, we aim to provide a reproducible, ready‑to‑use resource for automated lesion quantification in ultra‑high field MS research (https://github.com/maynord/7T‑MS‑lesion‑segmentation).

Authors:Borislav Mavrin
Title: In harmony with gpt-oss
Abstract:
No one has independently reproduced OpenAI's published scores for gpt‑oss‑20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse‑engineered the model's in‑distribution tools: when prompted without tool definitions, gpt‑oss still calls tools from its training distribution with high statistical confidence ‑‑ a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

Authors:Ankit Grover, Lodovico Giaretta, Rémi Bourgerie, Sarunas Girdzijauskas
Title: Is One Token All It Takes? Graph Pooling Tokens for LLM-based GraphQA
Abstract:
The integration of Graph Neural Networks (GNNs) with Large Language Models (LLMs) has emerged as a promising paradigm for Graph Question Answering (GraphQA). However, effective methods for encoding complex structural information into the LLM's latent space remain an open challenge. Current state‑of‑the‑art architectures, such as G‑Retriever, typically rely on standard GNNs and aggressive mean pooling to compress entire graph substructures into a single token, creating a severe information bottleneck. This work mitigates this bottleneck by investigating two orthogonal strategies: (1) increasing the bandwidth of the graph‑to‑LLM interface via multi‑token pooling, and (2) enhancing the semantic quality of the graph encoder via global attention mechanisms. We evaluate a suite of hierarchical pruning and clustering‑based pooling operators including Top‑k, SAGPool, DiffPool, MinCutPool, and Virtual Node Pooling (VNPool) to project graph data into multiple learnable tokens. Empirically, we demonstrate that while pooling introduces significant instability during soft prompt tuning, the application of Low‑Rank Adaptation (LoRA) effectively stabilizes specific hierarchical projections (notably VNPool and pruning methods), though dense clustering operators remain challenging. This stabilization allows compressed representations to rival full‑graph baselines (achieving ~73% Hit@1 on WebQSP). Conceptually, we demonstrate that a Graph Transformer with VNPool implementation functions structurally as a single‑layer Perceiver IO encoder. Finally, we adapt the FandE (Features and Edges) Score to the generative GraphQA domain. Our analysis reveals that the GraphQA benchmark suffers from representational saturation, where target answers are often highly correlated with isolated node features. The implementation is available at https://github.com/Agrover112/G‑Retriever/tree/all_good/

Authors:Yagiz Ihlamur
Title: When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction
Abstract:
Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields ‑‑ jobs, education, exits ‑‑ and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 ‑‑ a +17.7pp improvement over the zero‑shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = ‑0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly ‑‑ it is a lossy re‑encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic ‑‑ one that points directly to what a richer dataset would need to include.

Authors:Huseyin Tuna Erdinc, Ipsita Bhar, Rafael Orozco, Thales Souza, Felix J. Herrmann
Title: SAGE: Subsurface AI-driven Geostatistical Extraction with proxy posterior
Abstract:
Recent advances in generative networks have enabled new approaches to subsurface velocity model synthesis, offering a compelling alternative to traditional methods such as Full Waveform Inversion. However, these approaches predominantly rely on the availability of large‑scale datasets of high‑quality, geologically realistic subsurface velocity models, which are often difficult to obtain in practice. We introduce SAGE, a novel framework for statistically consistent proxy velocity generation from incomplete observations, specifically sparse well logs and migrated seismic images. During training, SAGE learns a proxy posterior over velocity models conditioned on both modalities (wells and seismic); at inference, it produces full‑resolution velocity fields conditioned solely on migrated images, with well information implicitly encoded in the learned distribution. This enables the generation of geologically plausible and statistically accurate velocity realizations. We validate SAGE on both synthetic and field datasets, demonstrating its ability to capture complex subsurface variability under limited observational constraints. Furthermore, samples drawn from the learned proxy distribution can be leveraged to train downstream networks, supporting inversion workflows. Overall, SAGE provides a scalable and data‑efficient pathway toward learning geological proxy posterior for seismic imaging and inversion. Repo link: https://github.com/slimgroup/SAGE.

Authors:Jinghan Yao, Sam Adé Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda
Title: MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
Abstract:
Long‑context decoding in LLMs is IO‑bound: each token re‑reads an ever‑growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long‑form generation. We introduce MAC‑Attention, a fidelity‑ and access‑preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre‑RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model‑agnostic and composes with IO‑aware kernels, paged‑KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC‑Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention‑phase speedups, up to 2.6x end‑to‑end, while maintaining full‑attention quality. By reusing computation, MAC‑Attention delivers long‑context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC‑Attention.git

Authors:Annette Taberner-Miller
Title: ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Abstract:
Production LLM serving often relies on multi‑model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade‑off is non‑stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open‑source adaptive router built on cost‑aware contextual bandits that is the first to simultaneously enforce dollar‑denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal‑dual budget pacer enforces a per‑request cost ceiling over an open‑ended stream, replacing offline penalty tuning with closed‑loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot‑swap registry lets operators add or remove models at runtime, with a brief forced‑exploration phase for each newcomer, after which UCB selection discovers its quality‑cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three‑model portfolio. Across seven budget ceilings, mean per‑request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order‑of‑magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold‑started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget‑gated and low‑quality models rejected after bounded exploration. End‑to‑end routing latency is 9.8ms on CPU ‑‑ less than 0.4% of typical inference time ‑‑ with the routing decision itself taking just 22.5us.

Authors:Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, Yesh Dattatreya
Title: Generalizable Dense Reward for Long-Horizon Robotic Tasks
Abstract:
Existing robotic foundation policies are trained primarily via large‑scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long‑horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision‑Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self‑certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm‑up phase, avoiding prohibitive inference cost during full training; and self‑certainty provides per‑step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM‑based value initialization primarily improves task completion efficiency, while self‑certainty primarily enhances success rates, particularly on out‑of‑distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state‑of‑the‑art RL finetuning methods on in‑distribution tasks, and up to 10% gains on out‑of‑distribution tasks, all without manual reward engineering. Additional visualizations can be found in https://silongyong.github.io/vllr_project_page/

Authors:Luan Borges Teodoro Reis Sena, Francisco Galuppo Azevedo
Title: Real-Time Explanations for Tabular Foundation Models
Abstract:
Interpretability is central for scientific machine learning, as understanding \emphwhy models make predictions enables hypothesis generation and validation. While tabular foundation models show strong performance, existing explanation methods like SHAP are computationally expensive, limiting interactive exploration. We introduce ShapPFN, a foundation model that integrates Shapley value regression directly into its architecture, producing both predictions and explanations in a single forward pass. On standard benchmarks, ShapPFN achieves competitive performance while producing high‑fidelity explanations (R^2=0.96, cosine=0.99) over 1000× faster than KernelSHAP (0.06s vs 610s). Our code is available at https://github.com/kunumi/ShapPFN

Authors:Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu
Title: DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
Abstract:
The development of Vision‑Language‑Action (VLA) models has been significantly accelerated by pre‑trained Vision‑Language Models (VLMs). However, most existing end‑to‑end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision‑language features to low‑level actions. This paradigm underutilizes the VLM's potential in high‑level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high‑level decision making and low‑level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM‑based System‑2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System‑1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two‑stage training paradigm: a decoupled warmup phase where System‑2 learns to predict latent futures while System‑1 learns motor control under ground‑truth future guidance within a unified feature space, followed by seamless end‑to‑end joint optimization. This enables action‑aware gradients to refine the VLM backbone in a controlled manner, preserving pre‑trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state‑of‑the‑art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero‑shot generalization to unseen objects and novel configurations during real‑world deployment on a humanoid robot.

Authors:Giovanni Seraghiti, Kévin Dubrulle, Arnaud Vandaele, Nicolas Gillis
Title: Nonnegative Matrix Factorization in the Component-Wise L1 Norm for Sparse Data
Abstract:
Nonnegative matrix factorization (NMF) approximates a nonnegative matrix, X, by the product of two nonnegative factors, WH, where W has r columns and H has r rows. In this paper, we consider NMF using the component‑wise L1 norm as the error measure (L1‑NMF), which is suited for data corrupted by heavy‑tailed noise, such as Laplace noise or salt and pepper noise, or in the presence of outliers. Our first contribution is an NP‑hardness proof for L1‑NMF, even when r=1, in contrast to the standard NMF that uses least squares. Our second contribution is to show that L1‑NMF strongly enforces sparsity in the factors for sparse input matrices, thereby favoring interpretability. However, if the data is affected by false zeros, too sparse solutions might degrade the model. Our third contribution is a new, more general, L1‑NMF model for sparse data, dubbed weighted L1‑NMF (wL1‑NMF), where the sparsity of the factorization is controlled by adding a penalization parameter to the entries of WH associated with zeros in the data. The fourth contribution is a new coordinate descent (CD) approach for wL1‑NMF, denoted as sparse CD (sCD), where each subproblem is solved by a weighted median algorithm. To the best of our knowledge, sCD is the first algorithm for L1‑NMF whose complexity scales with the number of nonzero entries in the data, making it efficient in handling large‑scale, sparse data. We perform extensive numerical experiments on synthetic and real‑world data to show the effectiveness of our new proposed model (wL1‑NMF) and algorithm (sCD).

Authors:Lixin Xiu, Xufang Luo, Hideki Nakayama
Title: A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
Abstract:
Large vision‑language models (LVLMs) achieve impressive performance, yet their internal decision‑making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs ‑‑ decomposing a model's decision‑relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model‑agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions ‑‑ breadth (cross‑model & cross‑task), depth (layer‑wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy‑driven vs. knowledge‑driven) and (ii) two stable, contrasting family‑level strategies (fusion‑centric vs. language‑centric). We also uncover a consistent three‑phase pattern in layer‑wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy‑only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid‑lvlm‑analysis .

Authors:Cheng Yang, Yu Hao, Qi Zhang, Chuan Shi
Title: Disentangled Graph Prompting for Out-Of-Distribution Detection
Abstract:
When testing data and training data come from different distributions, deep neural networks (DNNs) will face significant safety risks in practical applications. Therefore, out‑of‑distribution (OOD) detection techniques, which can identify OOD samples at test time and alert the system, are urgently needed. Existing graph OOD detection methods usually characterize fine‑grained in‑distribution (ID) patterns from multiple perspectives, and train end‑to‑end graph neural networks (GNNs) for prediction. However, due to the unavailability of OOD data during training, the absence of explicit supervision signals could lead to sub‑optimal performance of end‑to‑end encoders. To address this issue, we follow the pre‑training+prompting paradigm to utilize pre‑trained GNN encoders, and propose Disentangled Graph Prompting (DGP), to capture fine‑grained ID patterns with the help of ID graph labels. Specifically, we design two prompt generators that respectively generate class‑specific and class‑agnostic prompt graphs by modifying the edge weights of an input graph. We also design several effective losses to train the prompt generators and prevent trivial solutions. We conduct extensive experiments on ten datasets to demonstrate the superiority of our proposed DGP, which achieves a relative AUC improvement of 3.63% over the best graph OOD detection baseline. Ablation studies and hyper‑parameter experiments further show the effectiveness of DGP. Code is available at https://github.com/BUPT‑GAMMA/DGP.

Authors:Linda Zeng, Steven Y. Feng, Michael C. Frank
Title: Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models
Abstract:
Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M‑word mono‑ and bilingual datasets using synthetic data and machine translation. We train GPT‑2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in‑principle challenges for agnostic statistical learners.

Authors:Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank
Title: Baby Scale: Investigating Models Trained on Individual Children's Language Input
Abstract:
Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human‑scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6‑36 months), we investigate (1) scaling performance at child‑scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high‑quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child‑directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small‑scale language models while also shedding light on human language acquisition.

Authors:Zhongheng Jiang, Yuechao Zhao, Donglin Xie, Chenxi Sun, Rongchen Lu, Silu Luo, Zisheng Liang, Shenda Hong
Title: mtslearn: Machine Learning in Python for Medical Time Series
Abstract:
Medical time‑series data captures the dynamic progression of patient conditions, playing a vital role in modern clinical decision support systems. However, real‑world clinical data is highly heterogeneous and inconsistently formatted. Furthermore, existing machine learning tools often have steep learning curves and fragmented workflows. Consequently, a significant gap remains between cutting‑edge AI technologies and clinical application. To address this, we introduce mtslearn, an end‑to‑end integrated toolkit specifically designed for medical time‑series data. First, the framework provides a unified data interface that automates the parsing and alignment of wide, long, and flat data formats. This design significantly reduces data cleaning overhead. Building on this, mtslearn provides a complete pipeline from data reading and feature engineering to model training and result visualization. Furthermore, it offers flexible interfaces for custom algorithms. Through a modular design, mtslearn simplifies complex data engineering tasks into a few lines of code. This significantly lowers the barrier to entry for clinicians with limited programming experience, empowering them to focus more on exploring medical hypotheses and accelerating the translation of advanced algorithms into real‑world clinical practice. mtslearn is publicly available at https://github.com/PKUDigitalHealth/mtslearn.

Authors:Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang
Title: AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
Abstract:
Pre‑trained vision‑language models (VLMs) exhibit strong zero‑shot generalization but remain vulnerable to adversarial perturbations. Existing classification‑guided adversarial fine‑tuning methods often disrupt pre‑trained cross‑modal alignment, weakening visual‑textual correspondence and degrading zero‑shot performance. In this paper, we propose an Alignment‑Guided Fine‑Tuning (AGFT) framework that enhances zero‑shot adversarial robustness while preserving the cross‑modal semantic structure. Unlike label‑based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text‑guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero‑shot adversarial robustness. To address structural discrepancies introduced by fine‑tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature‑scaled version of the pre‑trained model predictions. Extensive experiments across multiple zero‑shot benchmarks demonstrate that AGFT outperforms state‑of‑the‑art methods while significantly improving zero‑shot adversarial robustness.

Authors:Tal Ishon, Yoav Goldberg, Uri Shaham
Title: PRISM: PRIor from corpus Statistics for topic Modeling
Abstract:
Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre‑trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce PRISM, a corpus‑intrinsic method that derives a Dirichlet parameter from word co‑occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA‑seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus‑driven initialization for topic modeling in resource‑constrained settings. Code is available at: https://github.com/shaham‑lab/PRISM.

Authors:Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang
Title: Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
Abstract:
Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error‑prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two‑pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain‑of‑Structured‑Thought (CoST). We introduce a CoST template, a schema‑aware instruction that guides a strong LLM to produce both a step‑wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine‑tuning. The compact models are trained on LLM‑generated CoST data in two stages: Supervised Fine‑Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure‑first behavior into SLMs, this approach achieves LLM‑comparable quality on multi‑domain long‑document QA using 3B/7B SLMs, while delivering 2‑4x lower latency than GPT‑4o and DeepSeek‑R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.

Authors:Disen Liao, Felix Dangel, Yaoliang Yu
Title: Efficient Bilevel Optimization with KFAC-Based Hypergradients
Abstract:
Bilevel optimization (BO) is widely applicable to many machine learning problems. Scaling BO, however, requires repeatedly computing hypergradients, which involves solving inverse Hessian‑vector products (IHVPs). In practice, these operations are often approximated using crude surrogates such as one‑step gradient unrolling or identity/short Neumann expansions, which discard curvature information. We build on implicit function theorem‑based algorithms and propose to incorporate Kronecker‑factored approximate curvature (KFAC), yielding curvature‑aware hypergradients with a better performance efficiency trade‑off than Conjugate Gradient (CG) or Neumann methods and consistently outperforming unrolling. We evaluate this approach across diverse tasks, including meta‑learning and AI safety problems. On models up to BERT, we show that curvature information is valuable at scale, and KFAC can provide it with only modest memory and runtime overhead. Our implementation is available at https://github.com/liaodisen/NeuralBo.

Authors:Jaber Jaber, Osama Jaber
Title: HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
Abstract:
World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object‑centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three‑level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two‑stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M‑parameter model on the PushT robotic manipulation benchmark from the Open X‑Embodiment dataset, achieving 0.008 MSE next‑state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow‑ai/hclsm

Authors:Caio Vicentino
Title: PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Abstract:
We present PolarQuant, a post‑training weight quantization method for large language models (LLMs) that exploits the distributional structure of neural network weights to achieve near‑lossless compression. PolarQuant operates in three stages: (1) block‑wise normalization to the unit hypersphere, (2) Walsh‑Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5‑9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re‑quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.

Authors:Tushar Dhananjay Pathak
Title: ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning
Abstract:
This paper presents ARCS (Autoregressive Circuit Synthesis), a system for amortized analog circuit generation that produces complete, SPICE‑simulatable designs (topology and component values) in milliseconds rather than the minutes required by search‑based methods. A hybrid pipeline combining two learned generators (a graph VAE and a flow‑matching model) with SPICE‑based ranking achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single‑model inference, a topology‑aware Graph Transformer with Best‑of‑3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution adapts Group Relative Policy Optimization (GRPO) to multi‑topology circuit reinforcement learning, resolving a critical failure mode of REINFORCE (cross‑topology reward distribution mismatch) through per‑topology advantage normalization. This improves simulation validity by +9.6 percentage points over REINFORCE in only 500 RL steps (10x fewer). Grammar‑constrained decoding additionally guarantees 100% structural validity by construction via topology‑aware token masking.

Authors:Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi, Eunjung Yeo, William Chen, Shinji Watanabe, David R. Mortensen
Title: An Empirical Recipe for Universal Phone Recognition
Abstract:
Phone recognition (PR) is a key enabler of multilingual and low‑resource speech processing tasks, yet robust performance remains elusive. Highly performant English‑focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS ‑‑ trained on large‑scale multilingual data and achieving state‑of‑the‑art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

Authors:Subhadip Mitra
Title: Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
Abstract:
Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark‑LLM‑Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data‑parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t‑tests, McNemar's test, or Wilcoxon signed‑rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content‑addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re‑running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.

Authors:Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen
Title: Rethinking Language Model Scaling under Transferable Hypersphere Optimization
Abstract:
Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first‑order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed‑norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture‑of‑Experts (MoE) granularity under the Frobenius‑sphere constraint with the Muon optimizer. We prove that weight decay is a first‑order no‑op on the Frobenius sphere, show that Depth‑μP remains necessary, and find that the optimal learning rate follows the same data‑scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding 1.58× compute efficiency over a strong Muon baseline at 6×10^21 FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including Z‑values, output RMS, and activation outliers, remain bounded and non‑increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load‑balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

Authors:Yihan Gao, Chenxi Huang, Wen Shi, Ke Sun, Ziqi Xu, Xikun Zhang, Mingliang Hou, Renqiang Luo
Title: FairGC: Fairness-aware Graph Condensation
Abstract:
Graph condensation (GC) has become a vital strategy for scaling Graph Neural Networks by compressing massive datasets into small, synthetic node sets. While current GC methods effectively maintain predictive accuracy, they are primarily designed for utility and often ignore fairness constraints. Because these techniques are bias‑blind, they frequently capture and even amplify demographic disparities found in the original data. This leads to synthetic proxies that are unsuitable for sensitive applications like credit scoring or social recommendations. To solve this problem, we introduce FairGC, a unified framework that embeds fairness directly into the graph distillation process. Our approach consists of three key components. First, a Distribution‑Preserving Condensation module synchronizes the joint distributions of labels and sensitive attributes to stop bias from spreading. Second, a Spectral Encoding module uses Laplacian eigen‑decomposition to preserve essential global structural patterns. Finally, a Fairness‑Enhanced Neural Architecture employs multi‑domain fusion and a label‑smoothing curriculum to produce equitable predictions. Rigorous evaluations on four real‑world datasets, show that FairGC provides a superior balance between accuracy and fairness. Our results confirm that FairGC significantly reduces disparity in Statistical Parity and Equal Opportunity compared to existing state‑of‑the‑art condensation models. The codes are available at https://github.com/LuoRenqiang/FairGC.

Authors:Yangmei Chen, Zhongyuan Zhang, Xikun Zhang, Xinyu Hao, Mingliang Hou, Renqiang Luo, Ziqi Xu
Title: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification
Abstract:
Thyroid nodule classification using ultrasound imaging is essential for early diagnosis and clinical decision‑making; however, despite promising performance on in‑distribution data, existing deep learning methods often exhibit limited robustness and generalisation when deployed across different ultrasound devices or clinical environments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. To address this challenge, we propose PEMV‑thyroid, a Prototype‑Enhanced Multi‑View learning framework that accounts for data heterogeneity by learning complementary representations from multiple feature perspectives and refining decision boundaries through a prototype‑based correction mechanism with mixed prototype information. By integrating multi‑view representations with prototype‑level guidance, the proposed approach enables more stable representation learning under heterogeneous imaging conditions. Extensive experiments on multiple thyroid ultrasound datasets demonstrate that PEMV‑thyroid consistently outperforms state‑of‑the‑art methods, particularly in cross‑device and cross‑domain evaluation scenarios, leading to improved diagnostic accuracy and generalisation performance in real‑world clinical settings. The source code is available at https://github.com/chenyangmeii/Prototype‑Enhanced‑Multi‑View‑Learning.

Authors:Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung
Title: LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
Abstract:
Vision‑Language‑Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre‑trained vision‑language backbones. However, in downstream robotic settings, they are typically fine‑tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO‑Para, a controlled benchmark that independently varies action expressions and object references for fine‑grained analysis of linguistic generalization. Across seven VLA configurations (0.6B‑7.5B), we observe consistent performance degradation of 22‑52 pp under paraphrasing. This degradation is primarily driven by object‑level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface‑level matching rather than semantic grounding. Moreover, 80‑96% of failures arise from planning‑level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau‑hai‑lab/LIBERO‑Para

Authors:Qing Qing, Huafei Huang, Mingliang Hou, Renqiang Luo, Mohsen Guizani
Title: NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information
Abstract:
Graph anomaly detection (GAD) aims to identify irregular nodes or structures in attributed graphs. Neighbor information, which reflects both structural connectivity and attribute consistency with surrounding nodes, is essential for distinguishing anomalies from normal patterns. Although recent graph neural network (GNN)‑based methods incorporate such information through message passing, they often fail to explicitly model its effect or interaction with attributes, limiting detection performance. This work introduces NeiGAD, a novel plug‑and‑play module that captures neighbor information through spectral graph analysis. Theoretical insights demonstrate that eigenvectors of the adjacency matrix encode local neighbor interactions and progressively amplify anomaly signals. Based on this, NeiGAD selects a compact set of eigenvectors to construct efficient and discriminative representations. Experiments on eight real‑world datasets show that NeiGAD consistently improves detection accuracy and outperforms state‑of‑the‑art GAD methods. These results demonstrate the importance of explicit neighbor modeling and the effectiveness of spectral analysis in anomaly detection. Code is available at: https://github.com/huafeihuang/NeiGAD.

Authors:Gnankan Landry Regis N'guessan
Title: FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks
Abstract:
Kolmogorov‑Arnold Networks (KAN) employ B‑spline bases on a fixed grid, providing no intrinsic multi‑scale decomposition for non‑smooth function approximation. We introduce Fractal Interpolation KAN (FI‑KAN), which incorporates learnable fractal interpolation function (FIF) bases from iterated function system (IFS) theory into KAN. Two variants are presented: Pure FI‑KAN (Barnsley, 1986) replaces B‑splines entirely with FIF bases; Hybrid FI‑KAN (Navascues, 2005) retains the B‑spline path and adds a learnable fractal correction. The IFS contraction parameters give each edge a differentiable fractal dimension that adapts to target regularity during training. On a Holder regularity benchmark (α\in [0.2, 2.0]), Hybrid FI‑KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI‑KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non‑smooth PDE solutions (scikit‑fem), Hybrid FI‑KAN achieves up to 79x improvement on rough‑coefficient diffusion and 3.5x on L‑shaped domain corner singularities. Pure FI‑KAN's complementary behavior, dominating on rough targets while underperforming on smooth ones, provides controlled evidence that basis geometry must match target regularity. A fractal dimension regularizer provides interpretable complexity control whose learned values recover the true fractal dimension of each target. These results establish regularity‑matched basis design as a principled strategy for neural function approximation.

Authors:He Yang, Dongyi Lv, Song Ma, Wei Xi, Zhi Wang, Hanlin Gu, Yajie Wang
Title: InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
Abstract:
Dataset Condensation (DC) is a data‑efficient learning paradigm that synthesizes small yet informative datasets, enabling models to match the performance of full‑data training. However, recent work exposes a critical vulnerability of DC to backdoor attacks, where malicious patterns (e.g., triggers) are implanted into the condensation dataset, inducing targeted misclassification on specific inputs. Existing attacks always prioritize attack effectiveness and model utility, overlooking the crucial dimension of stealthiness. To bridge this gap, we propose InkDrop, which enhances the imperceptibility of malicious manipulation without degrading attack effectiveness and model utility. InkDrop leverages the inherent uncertainty near model decision boundaries, where minor input perturbations can induce semantic shifts, to construct a stealthy and effective backdoor attack. Specifically, InkDrop first selects candidate samples near the target decision boundary that exhibit latent semantic affinity to the target class. It then learns instance‑dependent perturbations constrained by perceptual and spatial consistency, embedding targeted malicious behavior into the condensed dataset. Extensive experiments across diverse datasets validate the overall effectiveness of InkDrop, demonstrating its ability to integrate adversarial intent into condensed datasets while preserving model utility and minimizing detectability. Our code is available at https://github.com/lvdongyi/InkDrop.

Authors:Truong-Son Hy
Title: Q-BIOLAT: Binary Latent Protein Fitness Landscapes for QUBO-Based Optimization
Abstract:
Protein fitness optimization is inherently a discrete combinatorial problem, yet most learning‑based approaches rely on continuous representations and are primarily evaluated through predictive accuracy. We introduce Q‑BIOLAT, a framework for modeling and optimizing protein fitness landscapes in compact binary latent spaces. Starting from pretrained protein language model embeddings, we construct binary latent representations and learn a quadratic unconstrained binary optimization (QUBO) surrogate that captures unary and pairwise interactions. Beyond its formulation, Q‑BIOLAT provides a representation‑centric perspective on protein fitness modeling. We show that representations with similar predictive performance can induce fundamentally different optimization landscapes. In particular, learned autoencoder‑based representations collapse after binarization, producing degenerate latent spaces that fail to support combinatorial search, whereas simple structured representations such as PCA yield high‑entropy, decodable, and optimization‑friendly latent spaces. Across multiple datasets and data regimes, we demonstrate that classical combinatorial optimization methods, including simulated annealing, genetic algorithms, and greedy hill climbing, are highly effective in structured binary latent spaces. By expressing the objective in QUBO form, our approach connects modern machine learning with discrete and quantum‑inspired optimization. Our implementation and dataset are publicly available at: https://github.com/HySonLab/Q‑BIOLAT‑Extended

Authors:Duraimurugan Rajamanickam
Title: Decomposing Discrimination: Causal Mediation Analysis for AI-Driven Credit Decisions
Abstract:
Statistical fairness metrics in AI‑driven credit decisions conflate two causally distinct mechanisms: discrimination operating directly from a protected attribute to a credit outcome, and structural inequality propagating through legitimate financial features. We formalise this distinction using Pearl's framework of natural direct and indirect effects applied to the credit decision setting. Our primary theoretical contribution is an identification strategy for natural direct and indirect effects under treatment‑induced confounding ‑‑ the prevalent setting in which protected attributes causally affect both financial mediators and the final decision, violating standard sequential ignorability. We show that interventional direct and indirect effects (IDE/IIE) are identified under the weaker Modified Sequential Ignorability assumption, and prove that IDE/IIE provide conservative bounds on the unidentified natural effects under monotone indirect treatment response. We propose a doubly‑robust augmented inverse probability weighted (AIPW) estimator for IDE/IIE with semiparametric efficiency properties, implemented via cross‑fitting. An E‑value sensitivity analysis addresses residual confounding on the direct pathway. Empirical evaluation on 89,465 real HMDA conventional purchase mortgage applications from New York State (2022) demonstrates that approximately 77% of the observed 7.9 percentage‑point racial denial disparity operates through financial mediators shaped by structural inequality, while the remaining 23% constitutes a conservative lower bound on direct discrimination. The open‑source CausalFair Python package implements the full pipeline for deployment at resource‑constrained financial institutions.

Authors:Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong
Title: On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
Abstract:
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE‑based continual learners still suffer from forgetting due to routing‑drift: old‑task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new‑task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA‑DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift‑aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token‑level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing‑drift, while complementary routing score regularizations enforce expert‑group separation and promote new‑expert specialization. Extensive experiments demonstrate that our LLaVA‑DyMoE effectively mitigates routing‑drift‑induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

Authors:Suraj Ranganath, Vaishak Menon, Anish Patnaik
Title: KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study
Abstract:
Self‑forcing video generation extends a short‑horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key‑value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV‑cache compression for self‑forcing video generation on a Wan2.1‑based Self‑Forcing stack. Our study covers 33 quantization and cache‑policy variants, 610 prompt‑level observations, and 63 benchmark‑level summaries across two evaluation settings: MovieGen for single‑shot 10‑second generation and StoryEval for longer narrative‑style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16‑referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache‑inspired soft‑prune INT4 adaptation, which reaches 5.42‑5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest‑fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV‑cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at https://github.com/suraj‑ranganath/kv‑quant‑longhorizon/.

Authors:Mohsen Dehghankar, Abolfazl Asudeh
Title: RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication
Abstract:
Matrix‑vector multiplication is a fundamental building block in neural networks, vector databases, and large language models, particularly during inference. As a result, efficient matrix‑vector multiplication engines directly translate into more efficient inference. Recent work has explored low‑bit quantization of model weights, where matrices are represented using binary (1‑bit) or ternary (1.58‑bit) values while activation is kept in higher precision. These representations enable efficient hardware‑level computation. In parallel, algorithms such as Redundant Segment Reduction (RSR) provide theoretical guarantees for accelerating low‑bit matrix‑vector multiplication. However, existing implementations operate at the application level and cannot be efficiently integrated into hardware kernels, limiting practical performance. To bridge this gap, we present RSR‑core, a high‑performance engine that implements the RSR algorithm as optimized low‑level kernels for both CPU and CUDA environments. RSR‑core supports efficient matrix‑vector multiplication for binary and ternary weight matrices and general vectors while enabling practical deployment of RSR algorithm in real inference pipelines. RSR‑core is provided as a production‑ready engine with HuggingFace integration for preprocessing low‑bit models and running accelerated inference. Experimental results demonstrate significant performance improvements over baseline HuggingFace PyTorch multiplication, achieving up to 62x speedup on CPU and up to 1.9x speedup for token generation on CUDA for popular ternary LLMs. The source code is publicly available at https://github.com/UIC‑InDeXLab/RSR‑core.

Authors:Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai
Title: FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies
Abstract:
Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log‑probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified perspective to reconcile these seemingly disparate methods, thus hampering ongoing development. In this paper, we bridge this gap by introducing a comprehensive taxonomy for RL algorithms with diffusion/flow policies. To support reproducibility and agile prototyping, we introduce a modular, JAX‑based open‑source codebase that leverages JIT‑compilation for high‑throughput training. Finally, we provide systematic and standardized benchmarks across Gym‑Locomotion, DeepMind Control Suite, and IsaacLab, offering a rigorous side‑by‑side comparison of diffusion‑based methods and guidance for practitioners to choose proper algorithms based on the application. Our work establishes a clear foundation for understanding and algorithm design, a high‑efficiency toolkit for future research in the field, and an algorithmic guideline for practitioners in generative models and robotics. Our code is available at https://github.com/typoverflow/flow‑rl.

Authors:Naveen Mysore
Title: Diagnosing Non-Markovian Observations in Reinforcement Learning via Prediction-Based Violation Scoring
Abstract:
Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real‑world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without diagnostic tools for such violations. This paper introduces a prediction‑based scoring method that quantifies non‑Markovian structure in observation trajectories. A random forest first removes nonlinear Markov‑compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post‑hoc detection, 7 of 16 environment‑algorithm pairs, primarily high‑dimensional locomotion tasks, show significant positive monotonicity between noise intensity and the violation score (Spearman rho up to 0.78, confirmed under repeated‑measures analysis); under training‑time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low‑dimensional environments where the random forest absorbs the noise signal, causing the score to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that the proposed score correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non‑Markovian observations. Source code to reproduce all results is provided at https://github.com/NAVEENMN/Markovianes.

Authors:PengYu Chen, Shang Wan, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das
Title: VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly Detection
Abstract:
Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT‑enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross‑modal gaps or in‑domain heterogeneity. In this paper, we investigate the applicability of large‑scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN‑AD, a novel MAE‑based framework for TSAD. To alleviate the over‑generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real‑world datasets demonstrate that VAN‑AD consistently outperforms existing state‑of‑the‑art methods across multiple evaluation metrics.We make our code and datasets available at https://github.com/PenyChen/VAN‑AD.

Authors:Alberto G. Rodriguez Salgado
Title: From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
Abstract:
How do multimodal models solve visual spatial tasks ‑‑ through genuine planning, or through brute‑force search in token space? We introduce \textscMazeBench, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT‑5.4 solves 91% and Gemini 3.1 Pro 79%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710‑‑22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2‑‑12%; on 20×20 ultra‑hard mazes, they hit token limits and fail. Qualitative traces reveal a common two‑stage strategy: image‑to‑grid translation followed by token‑level search, effectively BFS in prose. A text‑grid ablation shows Claude Sonnet 4.6 rising from 6% on images to 80% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textscMazeBench therefore shows that high accuracy on visual planning tasks does not imply human‑like spatial understanding.

Authors:Ling Zhang, Boxiang Yun, Ting Jin, Qingli Li, Xinxing Li, Yan Wang
Title: Dictionary-based Pathology Mining with Hard-instance-assisted Classifier Debiasing for Genetic Biomarker Prediction from WSIs
Abstract:
Prediction of genetic biomarkers, e.g., microsatellite instability in colorectal cancer is crucial for clinical decision making. But, two primary challenges hamper accurate prediction: (1) It is difficult to construct a pathology‑aware representation involving the complex interconnections among pathological components. (2) WSIs contain a large proportion of areas unrelated to genetic biomarkers, which make the model easily overfit simple but irrelative instances. We hereby propose a Dictionary‑based hierarchical pathology mining with hard‑instance‑assisted classifier Debiasing framework to address these challenges, dubbed as D2Bio. Our first module, dictionary‑based hierarchical pathology mining, is able to mine diverse and very fine‑grained pathological contextual interaction without the limit to the distances between patches. The second module, hard‑instance‑assisted classfier debiasing, learns a debiased classifier via focusing on hard but task‑related features, without any additional annotations. Experimental results on five cohorts show the superiority of our method, with over 4% improvement in AUROC compared with the second best on the TCGA‑CRC‑MSI cohort. Our analysis further shows the clinical interpretability of D2Bio in genetic biomarker diagnosis and potential clinical utility in survival analysis. Code will be available at https://github.com/DeepMed‑Lab‑ECNU/D2Bio.

Authors:Swastik R
Title: Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages
Abstract:
Vision‑language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross‑lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross‑verification on 50 samples per language (inter‑translator agreement 0.79‑0.84). Eight VLMs, from 7B open‑source models to GPT‑4o, are evaluated across all seven languages, yielding 68,600 inference records that include text‑only and chain‑of‑thought ablations. I find accuracy drops of 9.8‑25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo‑Aryan. Chain‑of‑thought prompting degrades Bengali (‑14.4 pp) and Kannada (‑11.4 pp) rather than helping, exposing English‑centric reasoning chains. Aya‑Vision‑8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.

Authors:Guangli Li, Canbiao Wu, Na Tian, Li Zhang, Zhen Liang
Title: Boundary-aware Prototype-driven Adversarial Alignment for Cross-Corpus EEG Emotion Recognition
Abstract:
Electroencephalography (EEG)‑based emotion recognition suffers from severe performance degradation when models are transferred across heterogeneous datasets due to physiological variability, experimental paradigm differences, and device inconsistencies. Existing domain adversarial methods primarily enforce global marginal alignment and often overlook class‑conditional mismatch and decision boundary distortion, limiting cross‑corpus generalization. In this work, we propose a unified Prototype‑driven Adversarial Alignment (PAA) framework for cross‑corpus EEG emotion recognition. The framework is progressively instantiated in three configurations: PAA‑L, which performs prototype‑guided local class‑conditional alignment; PAA‑C, which further incorporates contrastive semantic regularization to enhance intra‑class compactness and inter‑class separability; and PAA‑M, the full boundary‑aware configuration that integrates dual relation‑aware classifiers within a three‑stage adversarial optimization scheme to explicitly refine controversial samples near decision boundaries. By combining prototype‑guided subdomain alignment, contrastive discriminative enhancement, and boundary‑aware aggregation within a coherent adversarial architecture, the proposed framework reformulates emotion recognition as a relation‑driven representation learning problem, reducing sensitivity to label noise and improving cross‑domain stability. Extensive experiments on SEED, SEED‑IV, and SEED‑V demonstrate state‑of‑the‑art performance under four cross‑corpus evaluation protocols, with average improvements of 6.72%, 5.59%, 6.69%, and 4.83%, respectively. Furthermore, the proposed framework generalizes effectively to clinical depression identification scenarios, validating its robustness in real‑world heterogeneous settings. The source code is available at https://github.com/WuCB‑BCI/PAA

Authors:Dávid Pukanec, Tibor Kubík, Michal Španěl
Title: From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion
Abstract:
We present ToothCraft, a diffusion‑based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real‑world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: https://github.com/ikarus1211/VISAPP_ToothCraft

Authors:Doğaç Eldenk, Stephen Xia
Title: UNIFERENCE: A Discrete Event Simulation Framework for Developing Distributed AI Models
Abstract:
Developing and evaluating distributed inference algorithms remains difficult due to the lack of standardized tools for modeling heterogeneous devices and networks. Existing studies often rely on ad‑hoc testbeds or proprietary infrastructure, making results hard to reproduce and limiting exploration of hypothetical hardware or network configurations. We present UNIFERENCE, a discrete‑event simulation (DES) framework designed for developing, benchmarking, and deploying distributed AI models within a unified environment. UNIFERENCE models device and network behavior through lightweight logical processes that synchronize only on communication primitives, eliminating rollbacks while preserving the causal order. It integrates seamlessly with PyTorch Distributed, enabling the same codebase to transition from simulation to real deployment. Our evaluation demonstrates that UNIFERENCE profiles runtime with up to 98.6% accuracy compared to real physical deployments across diverse backends and hardware setups. By bridging simulation and deployment, UNIFERENCE provides an accessible, reproducible platform for studying distributed inference algorithms and exploring future system designs, from high‑performance clusters to edge‑scale devices. The framework is open‑sourced at https://github.com/Dogacel/Uniference.

Authors:Siddhartha Laghuvarapu, Rohan Deb, Jimeng Sun
Title: KMM-CP: Practical Conformal Prediction under Covariate Shift via Selective Kernel Mean Matching
Abstract:
Uncertainty quantification is essential for deploying machine learning models in high‑stakes domains such as scientific discovery and healthcare. Conformal Prediction (CP) provides finite‑sample coverage guarantees under exchangeability, an assumption often violated in practice due to distribution shift. Under covariate shift, restoring validity requires importance weighting, yet accurate density‑ratio estimation becomes unstable when training and test distributions exhibit limited support overlap. We propose KMM‑CP, a conformal prediction framework based on Kernel Mean Matching (KMM) for covariate‑shift correction. We show that KMM directly controls the bias‑variance components governing conformal coverage error by minimizing RKHS moment discrepancy under explicit weight constraints, and establish asymptotic coverage guarantees under mild conditions. We then introduce a selective extension that identifies regions of reliable support overlap and restricts conformal correction to this subset, further improving stability in low‑overlap regimes. Experiments on molecular property prediction benchmarks with realistic distribution shifts show that KMM‑CP reduces coverage gap by over 50% compared to existing approaches. The code is available at https://github.com/siddharthal/KMM‑CP.

Authors:Cai Selvas-Sala, Lei Kang, Lluis Gomez
Title: SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
Abstract:
As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively‑trained encoders remains underexplored, and existing evaluations fail to diagnose fine‑grained, association‑level forgetting. We introduce SALMUBench (Sensitive Association‑Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona‑attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M‑pair retain base, with the Compromised model additionally trained on the sensitive set. We propose a novel evaluation protocol with structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility‑efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over‑generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.

Authors:Shuyi Gao, Stavros Orfanoudakis, Shengren Hou, Peter Palensky, Pedro P. Vergara
Title: Topology-Aware Graph Reinforcement Learning for Energy Storage Systems Optimal Dispatch in Distribution Networks
Abstract:
Optimal dispatch of energy storage systems (ESSs) in distribution networks involves jointly improving operating economy and voltage security under time‑varying conditions and possible topology changes. To support fast online decision making, we develop a topology‑aware Reinforcement Learning architecture based on Twin Delayed Deep Deterministic Policy Gradient (TD3), which integrates graph neural networks (GNNs) as graph feature encoders for ESS dispatch. We conduct a systematic investigation of three GNN variants: graph convolutional networks (GCNs), topology adaptive graph convolutional networks (TAGConv), and graph attention networks (GATs) on the 34‑bus and 69‑bus systems, and evaluate robustness under multiple topology reconfiguration cases as well as cross‑system transfer between networks with different system sizes. Results show that GNN‑based controllers consistently reduce the number and magnitude of voltage violations, with clearer benefits on the 69‑bus system and under reconfiguration; on the 69‑bus system, TD3‑GCN and TD3‑TAGConv also achieve lower saved cost relative to the NLP benchmark than the NN baseline. We also highlight that transfer gains are case‑dependent, and zero‑shot transfer between fundamentally different systems results in notable performance degradation and increased voltage magnitude violations. This work is available at: https://github.com/ShuyiGao/GNNs_RL_ESSs and https://github.com/distributionnetworksTUDelft/GNNs_RL_ESSs.

Authors:Yuhang Ma, Jie Wang, Zheng Yan
Title: Are LLM-Enhanced Graph Neural Networks Robust against Poisoning Attacks?
Abstract:
Large Language Models (LLMs) have advanced Graph Neural Networks (GNNs) by enriching node representations with semantic features, giving rise to LLM‑enhanced GNNs that achieve notable performance gains. However, the robustness of these models against poisoning attacks, which manipulate both graph structures and textual attributes during training, remains unexplored. To bridge this gap, we propose a robustness assessment framework that systematically evaluates LLM‑enhanced GNNs under poisoning attacks. Our framework enables comprehensive evaluation across multiple dimensions. Specifically, we assess 24 victim models by combining eight LLM‑ or Language Model (LM)‑based feature enhancers with three representative GNN backbones. To ensure diversity in attack coverage, we incorporate six structural poisoning attacks (both targeted and non‑targeted) and three textual poisoning attacks operating at the character, word, and sentence levels. Furthermore, we employ four real‑world datasets, including one released after the emergence of LLMs, to avoid potential ground truth leakage during LLM pretraining, thereby ensuring fair evaluation. Extensive experiments show that LLM‑enhanced GNNs exhibit significantly higher accuracy and lower Relative Drop in Accuracy (RDA) than a shallow embedding‑based baseline across various attack settings. Our in‑depth analysis identifies key factors that contribute to this robustness, such as the effective encoding of structural and label information in node representations. Based on these insights, we outline future research directions from both offensive and defensive perspectives, and propose a new combined attack along with a graph purification defense. To support future research, we release the source code of our framework at~\urlhttps://github.com/CyberAlSec/LLMEGNNRP.

Authors:Harunori Kawano, Takeshi Sasaki
Title: A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning
Abstract:
While self‑supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource‑constrained devices. To address this bottleneck, we propose HEAR (Human‑inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M‑94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre‑trained models are available at https://github.com/HarunoriKawano/HEAR

Authors:Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu
Title: QuitoBench: A High-Quality Open Time Series Forecasting Benchmark
Abstract:
Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large‑scale, high‑quality benchmarks. To address this gap, we introduce \textscQuitoBench, a regime‑balanced benchmark for time series forecasting with coverage across eight trend×seasonality×forecastability (TSF) regimes, designed to capture forecasting‑relevant properties rather than application‑defined domain labels. The benchmark is built upon \textscQuito, a billion‑scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context‑length crossover where deep learning models lead at short context (L=96) but foundation models dominate at long context (L \ge 576); (ii) forecastability is the dominant difficulty driver, producing a 3.64 × MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 × fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross‑benchmark and cross‑metric consistency. Our open‑source release enables reproducible, regime‑aware evaluation for time series forecasting research.

Authors:Mikalai Korbit, Mario Zanon
Title: Second-Order, First-Class: A Composable Stack for Curvature-Aware Training
Abstract:
Second‑order methods promise improved stability and faster convergence, yet they remain underused due to implementation overhead, tuning brittleness, and the lack of composable APIs. We introduce Somax, a composable Optax‑native stack that treats curvature‑aware training as a single JIT‑compiled step governed by a static plan. Somax exposes first‑class modules ‑‑ curvature operators, estimators, linear solvers, preconditioners, and damping policies ‑‑ behind a single step interface and composes with Optax by applying standard gradient transformations (e.g., momentum, weight decay, schedules) to the computed direction. This design makes typically hidden choices explicit and swappable. Somax separates planning from execution: it derives a static plan (including cadences) from module requirements, then runs the step through a specialized execution path that reuses intermediate results across modules. We report system‑oriented ablations showing that (i) composition choices materially affect scaling behavior and time‑to‑accuracy, and (ii) planning reduces per‑step overhead relative to unplanned composition with redundant recomputation.

Authors:Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li
Title: Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
Abstract:
Human driving behavior is inherently personal, which is shaped by long‑term habits and influenced by short‑term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end‑to‑end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision‑Language‑Action (VLA) driving framework that aligns with users' long‑term driving habits and adapts to real‑time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short‑term guidance. Closed‑loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human‑centered autonomous driving. Our data and code are available at https://dmw‑cvpr.github.io/.

Authors:Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez
Title: No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
Abstract:
Contrastive vision‑language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero‑shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter‑free cross‑modal attention‑pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero‑shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

Authors:Armand de Villeroché, Rem-Sophia Mouradi, Vincent Le Guen, Sibo Cheng, Marc Bocquet, Alban Farchi, Patrick Armand, Patrick Massin
Title: Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments
Abstract:
Air flow modeling at a local scale is essential for applications such as pollutant dispersion modeling or wind farm modeling. To circumvent costly Computational Fluid Dynamics (CFD) computations, deep learning surrogate models have recently emerged as promising alternatives. However, in the context of urban air flow, deep learning models struggle to adapt to the high variations of the urban geometry and to large mesh sizes. To tackle these challenges, we introduce Anchored Branched Steady‑state WInd Flow Transformer (AB‑SWIFT), a transformer‑based model with an internal branched structure uniquely designed for atmospheric flow modeling. We train our model on a specially designed database of atmospheric simulations around randomised urban geometries and with a mixture of unstable, neutral, and stable atmospheric stratifications. Our model reaches the best accuracy on all predicted fields compared to state‑of‑the‑art transformers and graph‑based models. Our code and data is available at https://github.com/cerea‑daml/abswift.

Authors:Shuoling Liu, Zhiquan Tan, Kun Yi, Hui Wu, Yihan Li, Jiangpeng Yan, Liyuan Chen, Kai Chen, Qiang Yang
Title: From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents
Abstract:
Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress‑test long‑horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure‑preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism‑aware benchmark with 296 questions designed to stress‑test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V‑structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state‑of‑the‑art achieving only a 19.9% average accuracy, exposing the difficulty of formal structural stress‑testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re‑ordering and exhibit robust ontological verification ‑‑ matching pure reasoning models in falsifying hallucinated premises ‑‑ they almost universally collapse on multi‑hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top‑tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnoteOur implementation will be available at https://github.com/tzq1999/CDR.

Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Title: How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
Abstract:
Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0‑‑60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features‑‑those with low firing rates‑‑survive pruning far better than frequent ones, with within‑condition Spearman correlations of rho = ‑1.0 in 11 of 17 experimental conditions. This counter‑intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high‑frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre‑trained SAEs remain viable on Wanda‑pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance‑‑a dissociation with implications for interpretability under compression.

Authors:Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, Curtis Langlotz
Title: Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
Abstract:
Out‑of‑distribution (OOD) detection aims to identify samples that deviate from in‑distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underlineTest‑time \underlineActivated \underlineNegative \underlineLabels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high‑confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution‑adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine‑grained, batch‑adaptive variant. To fully utilize label activation knowledge, we propose an activation‑aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training‑free, test‑efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large‑scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%. Codes are available at \hrefhttps://github.com/YBZh/OpenOOD‑VLMYBZh/OpenOOD‑VLM.

Authors:Yinjian Wang, Wei Li, Yuanyuan Gui, James E. Fowler, Gemine Vivone
Title: Robust Principal Component Completion
Abstract:
Robust principal component analysis (RPCA) seeks a low‑rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low‑rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post‑hoc thresholding required of most prior RPCA‑driven approaches. Experimental results reveal that the proposed approach delivers near‑optimal estimates on synthetic data as well as robust foreground‑extraction and anomaly‑detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP‑RPCC.

Authors:Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
Title: Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
Abstract:
Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post‑training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark‑style evaluations that assume one correct answer, many real‑world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non‑modal answers. This paper describes a multi‑answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference‑time search into the model's generative process. Across question‑answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set‑level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi‑answer RL as a principled and compute‑efficient alternative to inference‑time scaling procedures such as best‑of‑k. Code and more information can be found at https://multi‑answer‑rl.github.io/.

Authors:Yongda Fan, John Wu, Andrea Fitzpatrick, Naveen Baskaran, Jimeng Sun, Adam Cross
Title: A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study
Abstract:
Clinical decisions are high‑stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black‑box interpreters like KernelSHAP and LIME are computationally infeasible for time‑series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well‑documented open‑source framework: https://github.com/sunlabuiuc/PyHealth.

Authors:Manglam Kartik, Neel Tushar Shah
Title: Light Cones For Vision: Simple Causal Priors For Visual Hierarchy
Abstract:
Standard vision models treat objects as independent points in Euclidean space, unable to capture hierarchical structure like parts within wholes. We introduce Worldline Slot Attention, which models objects as persistent trajectories through spacetime worldlines, where each object has multiple slots at different hierarchy levels sharing the same spatial position but differing in temporal coordinates. This architecture consistently fails without geometric structure: Euclidean worldlines achieve 0.078 level accuracy, below random chance (0.33), while Lorentzian worldlines achieve 0.479‑0.661 across three datasets: a 6x improvement replicated over 20+ independent runs. Lorentzian geometry also outperforms hyperbolic embeddings showing visual hierarchies require causal structure (temporal dependency) rather than tree structure (radial branching). Our results demonstrate that hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones, achieved with only 11K parameters. The code is available at: https://github.com/iclrsubmissiongram/loco.

Authors:Daniel Benniah John
Title: Decentralized Task Scheduling in Distributed Systems: A Deep Reinforcement Learning Approach
Abstract:
Efficient task scheduling in large‑scale distributed systems presents significant challenges due to dynamic workloads, heterogeneous resources, and competing quality‑of‑service requirements. Traditional centralized approaches face scalability limitations and single points of failure, while classical heuristics lack adaptability to changing conditions. This paper proposes a decentralized multi‑agent deep reinforcement learning (DRL‑MADRL) framework for task scheduling in heterogeneous distributed systems. We formulate the problem as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP) and develop a lightweight actor‑critic architecture implemented using only NumPy, enabling deployment on resource‑constrained edge devices without heavyweight machine learning frameworks. Using workload characteristics derived from the publicly available Google Cluster Trace dataset, we evaluate our approach on a 100‑node heterogeneous system processing 1,000 tasks per episode over 30 experimental runs. Experimental results demonstrate 15.6% improvement in average task completion time (30.8s vs 36.5s for random baseline), 15.2% energy efficiency gain (745.2 kWh vs 878.3 kWh), and 82.3% SLA satisfaction compared to 75.5% for baselines, with all improvements statistically significant (p < 0.001). The lightweight implementation requires only NumPy, Matplotlib, and SciPy. Complete source code and experimental data are provided for full reproducibility at https://github.com/danielbenniah/marl‑distributed‑scheduling.

Authors:Shengli Zhou, Minghang Zheng, Feng Zheng, Yang Liu
Title: Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Abstract:
Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene‑language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object‑related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.

Authors:Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li
Title: Demystifying When Pruning Works via Representation Hierarchies
Abstract:
Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non‑generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation‑hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre‑softmax outputs), and probability (post‑softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning‑induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical‑token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non‑generative tasks such as retrieval and multiple‑choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE‑Lab‑UMD/Pruning‑on‑Representations

Authors:Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela
Title: Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Abstract:
The autoresearch repository enables an LLM agent to search for optimal hyperparameter configurations on an unconstrained search space by editing the training code directly. Given a fixed compute budget and constraints, we use autoresearch as a testbed to compare classical hyperparameter optimization (HPO) algorithms against LLM‑based methods on tuning the hyperparameters of a small language model. Within a fixed hyperparameter search space, classical HPO methods such as CMA‑ES and TPE consistently outperform LLM‑based agents. However, an LLM agent that directly edits training source code in an unconstrained search space narrows the gap to classical methods substantially despite using only a self‑hosted open‑weight 27B model. Methods that avoid out‑of‑memory failures outperform those with higher search diversity, suggesting that reliability matters more than exploration breadth. While small and mid‑sized LLMs struggle to track optimization state across trials, classical methods lack domain knowledge. To bridge this gap, we introduce Centaur, a hybrid that shares CMA‑ES's internal state, including mean vector, step‑size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, with its 0.8B variant outperforming the 27B variant, suggesting that a cheap LLM suffices when paired with a strong classical optimizer. The 0.8B model is insufficient for unconstrained code editing but sufficient for hybrid optimization, while scaling to 27B provides no advantage for fixed search space methods. Experiments with the frontier model Gemini 3.1 Pro Preview do not close the gap to classical methods. Code is available at https://github.com/ferreirafabio/autoresearch‑automl.

Authors:Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang
Title: UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
Abstract:
Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long‑horizon GUI tasks. To that end, we propose UI‑Voyager, a novel two‑stage self‑evolving mobile GUI agent. In the first stage, we employ Rejection Fine‑Tuning (RFT), which enables the continuous co‑evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self‑Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step‑level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human‑level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self‑evolving, and high‑performance mobile GUI automation without expensive manual data annotation.

Authors:Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko
Title: Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
Abstract:
LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citeprank2026posttrainbench, novikov2025alphaevolve. We show that an \emphautoresearch‑style pipeline \citepkarpathy2026autoresearch powered by Claude Code discovers novel white‑box adversarial attack algorithms that significantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citepzou2023universal, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT‑OSS‑Safeguard‑20B, compared to \leq10% for existing algorithms (\Creffig:teaser, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held‑out models, achieving 100% ASR against Meta‑SecAlign‑70B \citepchen2025secalign versus 56% for the best baseline (\Creffig:teaser, middle). Extending the findings of~\citecarlini2025autoadvexbench, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White‑box adversarial red‑teaming is particularly well‑suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

Authors:Arsen Kuzhamuratov, Mikhail Zhirnov, Andrey Kuznetsov, Ivan Oseledets, Konstantin Sobolev
Title: Marchuk: Efficient Global Weather Forecasting from Mid-Range to Sub-Seasonal Scales via Flow Matching
Abstract:
Accurate subseasonal weather forecasting remains a major challenge due to the inherently chaotic nature of the atmosphere, which limits the predictive skill of conventional models beyond the mid‑range horizon (approximately 15 days). In this work, we present Marchuk, a generative latent flow‑matching model for global weather forecasting spanning mid‑range to subseasonal timescales, with prediction horizons of up to 30 days. Marchuk conditions on current‑day weather maps and autoregressively predicts subsequent days' weather maps within the learned latent space. We replace rotary positional encodings (RoPE) with trainable positional embeddings and extend the temporal context window, which together enhance the model's ability to represent and propagate long‑range temporal dependencies during latent forecasting. Marchuk offers two key advantages: high computational efficiency and strong predictive performance. Despite its compact architecture of only 276 million parameters, the model achieves performance comparable to LaDCast, a substantially larger model with 1.6 billion parameters, while operating at significantly higher inference speeds. We open‑source our inference code and model at: https://v‑gen‑ai.github.io/Marchuk/

Authors:Yifeng Zhang, Harsh Goel, Peizhuo Li, Mehul Damani, Sandeep Chinchali, Guillaume Sartoretti
Title: CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control
Abstract:
Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever‑expanding cities. Multi‑Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL‑based framework designed to improve intra‑neighborhood traffic by enhancing decision‑making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network‑level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents' capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor‑aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision‑making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state‑of‑the‑art traffic signal control methods over three real‑world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at https://github.com/marmotlab/CoordLight

Authors:Eyal Weiss
Title: Cost-Sensitive Neighborhood Aggregation for Heterophilous Graphs: When Does Per-Edge Routing Help?
Abstract:
Recent work distinguishes two heterophily regimes: adversarial, where cross‑class edges dilute class signal and harm classification, and informative, where the heterophilous structure itself carries useful signal. We ask: when does per‑edge message routing help, and when is a uniform spectral channel sufficient? To operationalize this question we introduce Cost‑Sensitive Neighborhood Aggregation (CSNA), a GNN layer that computes pairwise distance in a learned projection and uses it to soft‑route each message through concordant and discordant channels with independent transformations. Under a contextual stochastic block model we show that mean aggregation can reverse the label‑aligned signal direction under heterophily, and that cost‑sensitive weighting with w_+/w_‑ > q/p preserves the correct sign. On six benchmarks with uniform tuning, CSNA is competitive with state‑of‑the‑art methods on adversarial‑heterophily datasets (Texas, Wisconsin, Cornell, Actor) but underperforms on informative‑heterophily datasets (Chameleon, Squirrel) ‑‑ precisely the regime where per‑edge routing has no useful decomposition to exploit. The pattern is itself the finding: the cost function's ability to separate edge types serves as a diagnostic for the heterophily regime, revealing when fine‑grained routing adds value over uniform channels and when it does not. Code is available at https://github.com/eyal‑weiss/CSNA‑public .

Authors:Minjun Kim, Minje Kim
Title: HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer
Abstract:
Personalized Federated Learning (PFL) aims to deliver effective client‑specific models under heterogeneous distributions, yet existing methods suffer from shallow prototype alignment and brittle server‑side distillation. We propose HEART‑PFL, a dual‑sided framework that (i) performs depth‑aware Hierarchical Directional Alignment (HDA) using cosine similarity in the early stage and MSE matching in the deep stage to preserve client specificity, and (ii) stabilizes global updates through Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Using lightweight adapters with only 1.46M trainable parameters, HEART‑PFL achieves state‑of‑the‑art personalized accuracy on CIFAR‑100, Flowers‑102, and Caltech‑101 (63.42%, 84.23%, and 95.67%, respectively) under Dirichlet non‑IID partitions, and remains robust to out‑of‑domain proxy data. Ablation studies further confirm that HDA and AKT provide complementary gains in alignment, robustness, and optimization stability, offering insights into how the two components mutually reinforce effective personalization. Overall, these results demonstrate that HEART‑PFL simultaneously enhances personalization and global stability, highlighting its potential as a strong and scalable solution for PFL(code available at https://github.com/danny0628/HEART‑PFL).

Authors:Rami Luisto
Title: A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings
Abstract:
Antonyms, or opposites, are sometimes defined as \emphword pairs that have all of the same contextually relevant properties but one. Seeing how transformer models seem to encode concepts as directions, this begs the question if one can detect ``antonymity'' in the geometry of the embedding vectors of word pairs, especially based on their difference vectors. Such geometrical studies are then naturally contrasted by comparing antonymic pairs to their opposites; synonyms. This paper started as an exploratory project on the complexity of the systems needed to detect the geometry of the embedding vectors of antonymic word pairs. What we now report is a curious ``swirl'' that appears across embedding models in a somewhat specific projection configuration.

Authors:Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye
Title: Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
Abstract:
Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor‑Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0‑1) to the sample's loss, thereby dynamically re‑weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high‑value samples, such as hard‑but‑learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

Authors:Mayssa Soussia, Gita Ayu Salsabila, Mohamed Ali Mahjoub, Islem Rekik
Title: Reservoir-Based Graph Convolutional Networks
Abstract:
Message passing is a core mechanism in Graph Neural Networks (GNNs), enabling the iterative update of node embeddings by aggregating information from neighboring nodes. Graph Convolutional Networks (GCNs) exemplify this approach by adapting convolutional operations for graph structures, allowing features from adjacent nodes to be combined effectively. However, GCNs encounter challenges with complex or dynamic data. Capturing long‑range dependencies often requires deeper layers, which not only increase computational costs but also lead to over‑smoothing, where node embeddings become indistinguishable. To overcome these challenges, reservoir computing has been integrated into GNNs, leveraging iterative message‑passing dynamics for stable information propagation without extensive parameter tuning. Despite its promise, existing reservoir‑based models lack structured convolutional mechanisms, limiting their ability to accurately aggregate multi‑hop neighborhood information. To address these limitations, we propose RGC‑Net (Reservoir‑based Graph Convolutional Network), which integrates reservoir dynamics with structured graph convolution. Key contributions include: (i) a reimagined convolutional framework with fixed random reservoir weights and a leaky integrator to enhance feature retention; (ii) a robust, adaptable model for graph classification; and (iii) an RGC‑Net‑powered transformer for graph generation with application to dynamic brain connectivity. Extensive experiments show that RGC‑Net achieves state‑of‑the‑art performance in classification and generative tasks, including brain graph evolution, with faster convergence and reduced over‑smoothing. Source code is available at https://github.com/basiralab/RGC‑Net .

Authors:Mingyi Liu
Title: The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
Abstract:
RLHF‑aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40‑79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling‑based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task‑dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base‑vs‑instruct ablation confirms the causal role of alignment: the base model shows 1.0% single‑cluster rate vs. 28.5% for the instruct model (p < 10^‑6). A training stage ablation (Base 0.0% ‑> SFT 1.5% ‑> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross‑family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B‑14B), with Jaccard, embedding, and NLI‑based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross‑embedder validation with two independent embedding families rules out coupling bias. Cross‑dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding ‑‑ response homogenization ‑‑ is implementation‑independent and label‑free. Motivated by this diagnosis, we explore a cheapest‑first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.

Authors:Anjun Gao, Zhenglin Wan, Pingfu Chao, Shunyu Yao
Title: Hierarchical Spatial-Temporal Graph-Enhanced Model for Map-Matching
Abstract:
The integration of GNSS data into portable devices has led to the generation of vast amounts of trajectory data, which is crucial for applications such as map‑matching. To tackle the limitations of rule‑based methods, recent works in deep learning for trajectory‑related tasks occur. However, existing models remain challenging due to issues such as the difficulty of large‑scale data labeling, ineffective modeling of spatial‑temporal relationships, and discrepancies between training and test data distributions. To tackle these challenges, we propose HSTGMatch, a novel model designed to enhance map‑matching performance. Our approach involves a two‑stage process: hierarchical self‑supervised learning and spatial‑temporal supervised learning. We introduce a hierarchical trajectory representation, leveraging both grid cells and geographic tuples to capture moving patterns effectively. The model constructs an Adaptive Trajectory Adjacency Graph to dynamically capture spatial relationships, optimizing GATs for improved efficiency. Furthermore, we incorporate a Spatial‑Temporal Factor to extract relevant features and employ a decay coefficient to address variations in trajectory length. Our extensive experiments demonstrate the model's superior performance, module effectiveness, and robustness, providing a promising solution for overcoming the existing limitations in map‑matching applications. The source code of HSTGMatch is publicly available on GitHub at https://github.com/Nerooo‑g/HSTGMatch.

Authors:Jimyung Hong, Jaehyung Kim
Title: Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade‑offs: task‑agnostic approaches cannot adapt to task‑specific requirements, while task‑aware methods require costly training to learn task adaptability. We propose DIET (Dimension‑wise global pruning of LLMs via merging Task‑wise importance scores), a training‑free structured pruning method that combines dimension‑level granularity with task‑aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre‑computation or training. Experiments on seven zero‑shot benchmarks using Gemma‑2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma‑2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state‑of‑the‑art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.

Authors:Forest Agostinelli
Title: The DeepXube Software Package for Solving Pathfinding Problems with Learned Heuristic Functions and Search
Abstract:
DeepXube is a free and open‑source Python package and command‑line tool that seeks to automate the solution of pathfinding problems by using machine learning to learn heuristic functions that guide heuristic search algorithms tailored to deep neural networks (DNNs). DeepXube is comprised of the latest advances in deep reinforcement learning, heuristic search, and formal logic for solving pathfinding problems. This includes limited‑horizon Bellman‑based learning, hindsight experience replay, batched heuristic search, and specifying goals with answer‑set programming. A robust multiple‑inheritance structure simplifies the definition of pathfinding domains and the generation of training data. Training heuristic functions is made efficient through the automatic parallelization of the generation of training data across central processing units (CPUs) and reinforcement learning updates across graphics processing units (GPUs). Pathfinding algorithms that take advantage of the parallelism of GPUs and DNN architectures, such as batch weighted A and Q search and beam search are easily employed to solve pathfinding problems through command‑line arguments. Finally, several convenient features for visualization, code profiling, and progress monitoring during training and solving are available. The GitHub repository is publicly available at https://github.com/forestagostinelli/deepxube.

Authors:Akshay Rangamani, Altay Unal
Title: Deep Neural Regression Collapse
Abstract:
Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.

Authors:Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis
Title: Sparse Autoencoders for Interpretable Medical Image Representation Learning
Abstract:
Vision foundation models (FMs) achieve state‑of‑the‑art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human‑interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general‑purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, (c) correspond to specific concepts that can be expressed in language using large language model (LLM)‑based auto‑interpretation. (d) bridge clinical language and abstract latent representations in zero‑shot language‑driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept‑driven medical vision systems. Code repository: https://github.com/pwesp/sail.

Authors:Igor Jankowski
Title: Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL
Abstract:
While Multi‑Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro‑frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge‑devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time‑Dilation MAPPO (ETD‑MAPPO), augmented with a Dual‑Gated Epistemic Trigger. Instead of depending on rigid frame‑skipping (macro‑actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state‑value divergence in a Twin‑Critic architecture). To format this, we structure the environment as a Semi‑Markov Decision Process (SMDP) and build the SMDP‑Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements (> 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115‑dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off‑ball execution without deteriorating centralized task dominance.

Authors:Jannik Endres, Etienne Laliberté, David Rolnick, Arthur Ouaknine
Title: Estimating Individual Tree Height and Species from UAV Imagery
Abstract:
Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree‑level traits such as height and species. Unoccupied Aerial Vehicles (UAVs) capturing high‑resolution imagery from a single RGB camera offer a cost‑effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH‑Trees, the first benchmark for individual tree height and species estimation from tree‑centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task‑specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH‑Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second‑best approach.

Authors:Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi
Title: AI Generalisation Gap In Comorbid Sleep Disorder Staging
Abstract:
Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor‑intensive, and manually scored. While deep learning enables automated EEG‑based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad‑CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE‑ResNet plus bidirectional LSTM model for single‑channel EEG sleep staging. As expected, cross‑domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject‑aware or disease‑specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

Authors:Wei Sun, Ting Wang, Xinran Tian, Wanshun Lan, Xuhan Feng, Haoyue Li, Fangxin Wang
Title: MetaKube: An Experience-Aware LLM Framework for Kubernetes Failure Diagnosis
Abstract:
Existing LLM‑based Kubernetes diagnostic systems cannot learn from operational experience, operating on static knowledge bases without improving from past resolutions. We present MetaKube, an experience‑aware LLM framework through three synergistic innovations: (1) an Episodic Pattern Memory Network (EPMN) that abstracts diagnostic patterns from historical resolutions and provides confidence‑calibrated retrieval for both rapid pattern matching and guided causal exploration, (2) a meta‑cognitive controller that dynamically routes between intuitive and analytical pathways based on problem familiarity, optimizing the trade‑off between speed and depth, and (3) KubeLLM, a locally‑deployable 8B model enhanced through domain‑specific post‑training on our 7,000‑sample Kubernetes Fault Resolution Dataset. Evaluation on 1,873 real‑world scenarios demonstrates MetaKube transforms Qwen3‑8B from 50.9 to 90.5 points, approaching GPT‑4.1 performance while ensuring complete data privacy. EPMN contributes 15.3% improvement through experiential learning, with continuous learning experiments showing progressive gains as the system accumulates operational knowledge. The source code and related resources are available at https://github.com/MetaKube‑LLM‑for‑Kubernetes‑Diagnosis/MetaKube.

Authors:Haoyu Wang, Yuxin Chen, Liang Luo, Buyun Zhang, Ellie Dingqiao Wen, Pan Li
Title: Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction
Abstract:
Multi‑turn human‑AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn‑wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine‑grained, turn‑wise process rewards from sparse outcome signals. Unlike volatile token‑level rewards, these turn‑level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi‑turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn‑wise preferences that are semantically aligned with human judgment. Code is publicly available at https://github.com/Graph‑COM/ITPO.

Authors:Bhavik Mangla
Title: MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG
Abstract:
RAG pipelines typically rely on fixed‑size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three‑stage pipeline for Markdown documents that (1) performs structure‑aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document‑level context; and (3) restructures chunks by merging those sharing the same semantic key via bin‑packing, co‑locating related content for retrieval. The single‑call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per‑field extraction passes. Rolling key propagation replaces hand‑tuned scoring with LLM‑native semantic matching. An empirical evaluation on 30 queries over an 18‑document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI‑compatible endpoint.

Authors:Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou
Title: VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
Abstract:
Video‑Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long‑horizon tasks through visual reasoning, they remain limited in contact‑rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine‑grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video‑Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross‑modal representation learning without tactile‑language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross‑modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact‑rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick‑and‑place requiring high‑fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

Authors:Ezgi Ozyilkan, Zhiqi Chen, Oren Rippel, Jona Ballé, Kedar Tatwawadi
Title: Drop-In Perceptual Optimization for 3D Gaussian Splatting
Abstract:
Despite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad‑hoc combinations of pixel‑level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a diverse set of distortion losses. We conduct the first‑of‑its‑kind large‑scale human subjective study on 3DGS, involving 39,320 pairwise ratings across several datasets and 3DGS frameworks. A regularized version of Wasserstein Distortion, which we call WD‑R, emerges as the clear winner, excelling at recovering fine textures without incurring a higher splat count. WD‑R is preferred by raters more than 2.3× over the original 3DGS loss, and 1.5× over current best method Perceptual‑GS. WD‑R also consistently achieves state‑of‑the‑art LPIPS, DISTS, and FID scores across various datasets, and generalizes across recent frameworks, such as Mip‑Splatting and Scaffold‑GS, where replacing the original loss with WD‑R consistently enhances perceptual quality within a similar resource budget (number of splats for Mip‑Splatting, model size for Scaffold‑GS), and leads to reconstructions being preferred by human raters 1.8× and 3.6×, respectively. We also find that this carries over to the task of 3DGS scene compression, with \approx 50% bitrate savings for comparable perceptual metric performance.

Authors:Chao Han, Stefanos Ioannou, Luca Manneschi, T. J. Hayward, Michael Mangan, Aditya Gilra, Eleni Vasilaki
Title: Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning
Abstract:
We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model‑based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high‑performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN‑trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model‑based and model‑free approaches across stochastic continuous‑control benchmarks. This work demonstrates the applicability of action‑conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: https://github.com/ChaoHan‑UoS/NeuralRL

Authors:Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones
Title: Sparser, Faster, Lighter Transformer Language Models
Abstract:
Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open‑source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.

Authors:Yuanhang Lei, Tao Cheng, Xingxuan Li, Boming Zhao, Siyuan Huang, Ruizhen Hu, Peter Yichen Chen, Hujun Bao, Zhaopeng Cui
Title: PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
Abstract:
Achieving real‑time physics‑based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics‑informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full‑space deformation, with subspace defined by handle transformations. To generate mesh‑free, discretization‑agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer‑based encoder and a cross‑attention decoder. Furthermore, we also develop a novel physics‑informed self‑supervised learning strategy that incorporates on‑the‑fly skinning‑field normalization and conflict‑aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints. PhysSkin shows outstanding performance on generalizable neural skinning and enables real‑time physics‑based animation.

Authors:Louis Claeys, Artur Goldman, Zebang Shen, Niao He
Title: A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control
Abstract:
High‑dimensional stochastic optimal control (SOC) becomes harder with longer planning horizons: existing methods scale linearly in the horizon T, with performance often deteriorating exponentially. We overcome these limitations for a subclass of linearly‑solvable SOC problems‑those whose uncontrolled drift is the gradient of a potential. In this setting, the Hamilton‑Jacobi‑Bellman equation reduces to a linear PDE governed by an operator \mathcalL. We prove that, under the gradient drift assumption, \mathcalL is unitarily equivalent to a Schrödinger operator \mathcalS = ‑Δ+ \mathcalV with purely discrete spectrum, allowing the long‑horizon control to be efficiently described via the eigensystem of \mathcalL. This connection provides two key results: first, for a symmetric linear‑quadratic regulator (LQR), \mathcalS matches the Hamiltonian of a quantum harmonic oscillator, whose closed‑form eigensystem yields an analytic solution to the symmetric LQR with \empharbitrary terminal cost. Second, in a more general setting, we learn the eigensystem of \mathcalL using neural networks. We identify implicit reweighting issues with existing eigenfunction learning losses that degrade performance in control tasks, and propose a novel loss function to mitigate this. We evaluate our method on several long‑horizon benchmarks, achieving an order‑of‑magnitude improvement in control accuracy compared to state‑of‑the‑art methods, while reducing memory usage and runtime complexity from \mathcalO(Td) to \mathcalO(d).

Authors:Donya Jafari, Farzan Farnia
Title: DAK-UCB: Diversity-Aware Prompt Routing for LLMs and Generative Models
Abstract:
The expansion of generative AI and LLM services underscores the growing need for adaptive mechanisms to select an appropriate available model to respond to a user's prompts. Recent works have proposed offline and online learning formulations to identify the optimal generative AI model for an input prompt, based solely on maximizing prompt‑based fidelity evaluation scores, e.g., CLIP‑Score in text‑to‑image generation. However, such fidelity‑based selection methods overlook the diversity of generated outputs, and hence, they can fail to address potential diversity shortcomings in the generated responses. In this paper, we introduce the Diversity‑Aware Kernelized Upper Confidence Bound (DAK‑UCB) method as a contextual bandit algorithm for the online selection of generative models with diversity considerations. The proposed DAK‑UCB method incorporates both fidelity and diversity‑related metrics into the selection process. We design this framework based on prompt‑aware diversity score functions that decompose to a two‑sample‑based expectation over prompt‑output pairs in the previous generation rounds. Specifically, we illustrate the application of our framework using joint kernel distance and kernel entropy measures. Our experimental results demonstrate the effectiveness of DAK‑UCB in promoting diversity‑aware model selection while maintaining fidelity in the generations for a sequence of prompts. The code is available at https://github.com/Donya‑Jafari/DAK‑UCB.

Authors:Devvrat Joshi, Islem Rekik
Title: HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature
Abstract:
Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi‑word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general‑purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two‑stage framework for scalable, zero‑shot scientific KG construction. The first stage, Z‑NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain‑agnostic entity recognition by isolating semantic "turns" in text, and (ii) a Multi‑Scale TCQK attention mechanism that captures coherent multi‑word entities through n‑gram‑aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy‑aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (https://github.com/basiralab/SPHERE), a multi‑domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out‑of‑distribution tests. In zero‑shot settings, gains reach 10.76% for NER and 26.2% for RE.

Authors:Davide Scassola, Dylan Ponsford, Adrián Javaloy, Sebastiano Saccani, Luca Bortolussi, Henry Gouk, Antonio Vergari
Title: A Sobering Look at Tabular Data Generation via Probabilistic Circuits
Abstract:
Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion‑based models are the current state‑of‑the‑art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline ‑‑ hierarchical mixture models in the form of deep probabilistic circuits (PCs) ‑‑ which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at https://github.com/april‑tools/tabpc.

Authors:Shengping Xie, Zekun Wu, Quan Chen, Kaixu Tang
Title: Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints
Abstract:
Implicit bias induced by gradient‑based algorithms is essential to the generalization of overparameterized models, yet its mechanisms can be subtle. This work leverages the Normalized Steepest Descent (NSD) framework to investigate how optimization geometry shapes solutions on multiclass separable data. We introduce NucGD, a geometry‑aware optimizer designed to enforce low rank structures through nuclear norm constraints. Beyond the algorithm itself, we connect NucGD with emerging low‑rank projection methods, providing a unified perspective. To enable scalable training, we derive an efficient SVD‑free update rule via asynchronous power iteration. Furthermore, we empirically dissect the impact of stochastic optimization dynamics, characterizing how varying levels of gradient noise induced by mini‑batch sampling and momentum modulate the convergence toward the expected maximum margin solutions.Our code is accessible at: https://github.com/Tsokarsic/observing‑the‑implicit‑bias‑on‑multiclass‑seperable‑data.

Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Title: Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures
Abstract:
Hybrid language models combining attention with state space models (SSMs) or linear attention offer improved efficiency, but whether both components are genuinely utilized remains unclear. We present a functional component ablation framework applied to two sub‑1B hybrid models ‑‑ Qwen3.5‑0.8B (sequential: Gated DeltaNet + softmax attention) and Falcon‑H1‑0.5B (parallel: Mamba‑2 + attention) ‑‑ with a pure Transformer control (Qwen2.5‑0.5B). Through group ablations, layer‑wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks, we establish four findings: (1) both component types are essential and neither is bypassed; (2) the alternative component (linear attention or SSM) is the primary language modeling backbone, causing >35,000x perplexity degradation when removed versus ~82x for attention; (3) component importance follows a positional gradient, with early layers being disproportionately critical; and (4) hybrid architectures exhibit 20‑119x greater resilience to random layer removal than pure Transformers, revealing built‑in functional redundancy between component types. These results provide actionable guidance for hybrid model compression, architecture design, and fault‑tolerant deployment.

Authors:Xingyu Chen, Junxiu An, Jun Guo, Yuqian Zhou
Title: Symbolic Graph Networks for Robust PDE Discovery from Noisy Sparse Data
Abstract:
Data‑driven discovery of partial differential equations (PDEs) offers a promising paradigm for uncovering governing physical laws from observational data. However, in practical scenarios, measurements are often contaminated by noise and limited by sparse sampling, which poses significant challenges to existing approaches based on numerical differentiation or integral formulations. In this work, we propose a Symbolic Graph Network (SGN) framework for PDE discovery under noisy and sparse conditions. Instead of relying on local differential approximations, SGN leverages graph message passing to model spatial interactions, providing a non‑local representation that is less sensitive to high frequency noise. Based on this representation, the learned latent features are further processed by a symbolic regression module to extract interpretable mathematical expressions. We evaluate the proposed method on several benchmark systems, including the wave equation, convection‑diffusion equation, and incompressible Navier‑Stokes equations. Experimental results show that SGN can recover meaningful governing relations or solution forms under varying noise levels, and demonstrates improved robustness compared to baseline methods in sparse and noisy settings. These results suggest that combining graph‑based representations with symbolic regression provides a viable direction for robust data‑driven discovery of physical laws from imperfect observations. The code is available at https://github.com/CXY0112/SGN

Authors:Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Title: Rethinking Multimodal Fusion for Time Series: Auxiliary Modalities Need Constrained Fusion
Abstract:
Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture‑specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug‑in method that enables controlled cross‑modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low‑rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods including CFA. Code is publicly available at: https://github.com/seunghan96/cfa/.

Authors:Fangyuan Li, Pengfei Li, Shijie Wang, Junqi Gao, Jianxing Liu, Biqing Qi, Yuqiang Li
Title: WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement
Abstract:
Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self‑improvement of language models, but existing methods face a key trade‑off: endogenous self‑play can drift over iterations, while corpus‑grounded approaches rely on curated data environments. We present WIST, a Web‑grounded Iterative Self‑play Tree framework for domain‑targeted reasoning improvement that learns directly from the open web without requiring any pre‑arranged domain corpus. WIST incrementally expands a domain tree for exploration, and retrieves and cleans path‑consistent web corpus to construct a controllable training environment. It then performs Challenger‑‑Solver self‑play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self‑evolution and corpus‑grounded self‑play baselines, with the Overall gains reaching +9.8 (Qwen3‑4B‑Base) and +9.7 (OctoThinker‑8B). WIST is also domain‑steerable, improving Qwen3‑8B‑Base by +14.79 in medicine and Qwen3‑4B‑Base by +5.28 on PhyBench. Ablations further confirm the importance of WIST's key components for stable open‑web learning. Our Code is available at https://github.com/lfy‑123/WIST.

Authors:Drake Caraker, Bryan Arnold, David Rhoads
Title: First-Mover Bias in Gradient Boosting Explanations: Mechanism, Detection, and Resolution
Abstract:
We isolate and empirically characterize first‑mover bias ‑‑ a path‑dependent concentration of feature importance caused by sequential residual fitting in gradient boosting ‑‑ as a specific mechanistic cause of the well‑known instability of SHAP‑based feature rankings under multicollinearity. When correlated features compete for early splits, gradient boosting creates a self‑reinforcing advantage for whichever feature is selected first: subsequent trees inherit modified residuals that favor the incumbent, concentrating SHAP importance on an arbitrary feature rather than distributing it across the correlated group. Scaling up a single model amplifies this effect ‑‑ a Large Single Model with the same total tree count as our method produces the worst explanations of any approach tested. We demonstrate that model independence is sufficient to resolve first‑mover bias in the linear regime, and remains the most effective mitigation under nonlinear data‑generating processes. Both our proposed method, DASH (Diversified Aggregation of SHAP), and simple seed‑averaging (Stochastic Retrain) restore stability by breaking the sequential dependency chain, confirming that the operative mechanism is independence between explained models. At rho=0.9, both achieve stability=0.977, while the single‑best workflow degrades to 0.958 and the Large Single Model to 0.938. On the Breast Cancer dataset, DASH improves stability from 0.32 to 0.93 (+0.61) against a tree‑count‑matched baseline. DASH additionally provides two diagnostic tools ‑‑ the Feature Stability Index (FSI) and Importance‑Stability (IS) Plot ‑‑ that detect first‑mover bias without ground truth, enabling practitioners to audit explanation reliability before acting on feature rankings. Software and reproducible benchmarks are available at https://github.com/DrakeCaraker/dash‑shap.

Authors:Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held
Title: Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits
Abstract:
Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute‑optimal allocation estimates, even on noise‑free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the 3.8×10^25 FLOP training budget and \1.4M (90% CI: \412K‑\2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry (α\neq β). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data‑efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two‑dimensional optimization that is well‑conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations. See https://github.com/Open‑Athena/vpnls for details and https://openathena.ai/scaling‑law‑analysis for other results from this study.

Authors:Chenhan Wang, Zhengyi Bao, Huipin Lin, Jiahao Nie, Chunxiang Zhu
Title: A Multi-Task Targeted Learning Framework for Lithium-Ion Battery State-of-Health and Remaining Useful Life
Abstract:
Accurately predicting the state‑of‑health (SOH) and remaining useful life (RUL) of lithium‑ion batteries is crucial for ensuring the safe and efficient operation of electric vehicles while minimizing associated risks. However, current deep learning methods are limited in their ability to selectively extract features and model time dependencies for these two parameters. Moreover, most existing methods rely on traditional recurrent neural networks, which have inherent shortcomings in long‑term time‑series modeling. To address these issues, this paper proposes a multi‑task targeted learning framework for SOH and RUL prediction, which integrates multiple neural networks, including a multi‑scale feature extraction module, an improved extended LSTM, and a dual‑stream attention module. First, a feature extraction module with multi‑scale CNNs is designed to capture detailed local battery decline patterns. Secondly, an improved extended LSTM network is employed to enhance the model's ability to retain long‑term temporal information, thus improving temporal relationship modeling. Building on this, the dual‑stream attention module‑comprising polarized attention and sparse attention to selectively focus on key information relevant to SOH and RUL, respectively, by assigning higher weights to important features. Finally, a many‑to‑two mapping is achieved through the dual‑task layer. To optimize the model's performance and reduce the need for manual hyperparameter tuning, the Hyperopt optimization algorithm is used. Extensive comparative experiments on battery aging datasets demonstrate that the proposed method reduces the average RMSE for SOH and RUL predictions by 111.3% and 33.0%, respectively, compared to traditional and state‑of‑the‑art methods.

Authors:Peisong Niu, Haifan Zhang, Yang Zhao, Tian Zhou, Ziqing Ma, Wenqiang Shen, Junping Zhao, Huiling Yuan, Liang Sun
Title: Enhancing AI-Based Tropical Cyclone Track and Intensity Forecasting via Systematic Bias Correction
Abstract:
Tropical cyclones (TCs) pose severe threats to life, infrastructure, and economies in tropical and subtropical regions, underscoring the critical need for accurate and timely forecasts of both track and intensity. Recent advances in AI‑based weather forecasting have shown promise in improving TC track forecasts. However, these systems are typically trained on coarse‑resolution reanalysis data (e.g., ERA5 at 0.25 degree), which constrains predicted TC positions to a fixed grid and introduces significant discretization errors. Moreover, intensity forecasting remains limited especially for strong TCs by the smoothing effect of coarse meteorological fields and the use of regression losses that bias predictions toward conditional means. To address these limitations, we propose BaguanCyclone, a novel, unified framework that integrates two key innovations: (1) a probabilistic center refinement module that models the continuous spatial distribution of TC centers, enabling finer track precision; and (2) a region‑aware intensity forecasting module that leverages high‑resolution internal representations within dynamically defined sub‑grid zones around the TC core to better capture localized extremes. Evaluated on the global IBTrACS dataset across six major TC basins, our system consistently outperforms both operational numerical weather prediction (NWP) models and most AI‑based baselines, delivering a substantial enhancement in forecast accuracy. Remarkably, BaguanCyclone excels in navigating meteorological complexities, consistently delivering accurate forecasts for re‑intensification, sweeping arcs, twin cyclones, and meandering events. Our code is available at https://github.com/DAMO‑DI‑ML/Baguan‑cyclone.

Authors:Yan Xie, Tiansheng Wen, Tangda Huang, Bo Chen, Chenyu You, Stefanie Jegelka, Yifei Wang
Title: Scaling Attention via Feature Sparsity
Abstract:
Scaling Transformers to ultra‑long contexts is bottlenecked by the O(n^2 d) cost of self‑attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token‑level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as k‑sparse codes that preserve high‑dimensional expressivity while reducing the cost of attention from Θ(n^2 d) to Θ(n^2 k^2/d). To make this efficient at scale, we introduce FlashSFA, an IO‑aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT‑2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to 2.5× and reducing FLOPs and KV‑cache by nearly 50%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short‑embedding baselines that collapse feature diversity. These results establish feature‑level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders‑of‑magnitude longer contexts with minimal quality loss. Code is available at https://github.com/YannX1e/Sparse‑Feature‑Attention.

Authors:Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang
Title: TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
Abstract:
Search‑augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open‑domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn‑Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn‑level rewards to each reasoning + tool‑call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential‑based reward shaping, TIPS offers fine‑grained and policy‑invariant guidance that overcomes the limitations of outcome‑only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen‑2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn‑level information‑potential reward shaping provides an effective and general solution to sparse‑reward credit assignment for multi‑turn LLM reasoning.

Authors:Xinyan Wang, Xiaogeng Liu, Chaowei Xiao
Title: ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
Abstract:
Large Reasoning Models (LRMs) often reach a correct solution before their long Chain‑of‑Thought trace ends, yet continue with redundant verification, repeated attempts, or unnecessary exploration that wastes computation and can even overturn the correct answer. We frame this behavior as a latent productive‑to‑redundant transition and show that it is directly reflected in hidden states: around first‑correct‑solution (FCS) boundaries, late‑layer representations separate efficient from overthinking tokens, while boundary‑permutation and position‑control baselines collapse. Based on this signal, we propose ROM, a model‑agnostic streaming intervention framework that monitors frozen LRMs with a lightweight hidden‑state detector and intervenes at well‑formed reasoning boundaries. Counterfactual Self‑Correction (CSC) augments supervision with balanced wrong to correct trajectories, preserving useful pre‑FCS correction while labeling only post‑FCS continuation as redundant. Across MATH500, GSM8K, AIME25, and MMLU‑Pro, ROM improves the overall tradeoff on both Qwen3‑8B and DeepSeek‑R1‑Distill‑Qwen‑32B (DS‑32B): on Qwen3‑8B, it raises accuracy from 74.47% to 74.78% and reduces response length from 4262 to 3107 tokens; on DS‑32B, it raises accuracy from 68.60% to 68.72% and reduces response length from 3062 to 2319 tokens. The same FCS‑derived supervision transfers across scale and training origin, suggesting a shared long‑CoT boundary rather than a backbone‑specific artifact. ROM is compatible with L1, removing another 20.9‑21.6% tokens at zero accuracy loss. ROM also generalizes to open‑ended MMLU‑Pro (+1.56 pp, 35.4% shorter) and reduces wall‑clock latency by 46.5%. Code is available at https://github.com/SaFo‑Lab/ROM.

Authors:Donald Shenaj, Federico Errica, Antonio Carta
Title: Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation
Abstract:
Low Rank Adaptation (LoRA) is the de facto fine‑tuning strategy to generate personalized images from pre‑trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community's consensus, regardless of the personalized subject's complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine‑tuning on a subject. We achieve it by imposing an ordering of importance on the rank's positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA^2, achieves a competitive trade‑off between DINO, CLIP‑I, and CLIP‑T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: https://github.com/donaldssh/NotAllLayersAreCreatedEqual.

Authors:Nikolas Stavrou, Siamak Mehrkanoon
Title: SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting
Abstract:
Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data‑driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT‑QMix‑UNet, an enhanced variant of SmaAT‑UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder‑decoder bridge, and mixed kernel depth‑wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model's size and improve its nowcasting performance. We train and evaluate SmaAT‑QMix‑UNet on a Dutch radar precipitation dataset (2016‑2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT‑QMix‑UNet. Grad‑CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT‑QMix‑UNet is publicly available on GitHub: https://github.com/nstavr04/MasterThesisSnellius.

Authors:Yuze Qin, Qingyong Li, Zhiqing Guo, Wen Wang, Yan Liu, Yangli-ao Geng
Title: Extending Precipitation Nowcasting Horizons via Spectral Fusion of Radar Observations and Foundation Model Priors
Abstract:
Precipitation nowcasting is critical for disaster mitigation and aviation safety. However, radar‑only models frequently suffer from a lack of large‑scale atmospheric context, leading to performance degradation at longer lead times. While integrating meteorological variables predicted by weather foundation models offers a potential remedy, existing architectures fail to reconcile the profound representational heterogeneities between radar imagery and meteorological data. To bridge this gap, we propose PW‑FouCast, a novel frequency‑domain fusion framework that leverages Pangu‑Weather forecasts as spectral priors within a Fourier‑based backbone. Our architecture introduces three key innovations: (i) Pangu‑Weather‑guided Frequency Modulation to align spectral magnitudes and phases with meteorological priors; (ii) Frequency Memory to correct phase discrepancies and preserve temporal evolution; and (iii) Inverted Frequency Attention to reconstruct high‑frequency details typically lost in spectral filtering. Extensive experiments on the SEVIR and MeteoNet benchmarks demonstrate that PW‑FouCast achieves state‑of‑the‑art performance, effectively extending the reliable forecast horizon while maintaining structural fidelity. Our code is available at https://github.com/Onemissed/PW‑FouCast.

Authors:Shiyan Hu, Jianxin Jin, Yang Shu, Peng Chen, Bin Yang, Chenjuan Guo
Title: Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction
Abstract:
Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross‑modal interaction effectively. To address the first challenge, we propose Fine‑grained Time‑text Semantic Alignment. It integrates exogenous and endogenous text information through cross‑view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross‑modal reconstruction to enable interaction. Extensive experiments on six real‑world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.

Authors:Hyoseok Park, Yeonsang Park
Title: PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection
Abstract:
Long‑context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step ‑‑ a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block‑selection step: a memory‑bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast‑and‑weight paradigm ‑‑ the query fans out to all candidates via passive splitting, signatures are quasi‑static (matching electro‑optic MRR programming), and only rank order matters (relaxing precision to 4‑6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner‑product Similarity with Microring weights), a thin‑film lithium niobate (TFLN) similarity engine. Hardware‑impaired needle‑in‑a‑haystack evaluation on Qwen2.5‑7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four‑order‑of‑magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).

Authors:Bayezid Baten, M. Ayyan Iqbal, Sebastian Ament, Julius Kusuma, Nishant Garg
Title: BOxCrete: A Bayesian Optimization Open-Source AI Model for Concrete Strength Forecasting and Mix Optimization
Abstract:
Modern concrete must simultaneously satisfy evolving demands for mechanical performance, workability, durability, and sustainability, making mix designs increasingly complex. Recent studies leveraging Artificial Intelligence (AI) and Machine Learning (ML) models show promise for predicting compressive strength and guiding mix optimization, but most existing efforts are based on proprietary industrial datasets and closed‑source implementations. Here we introduce BOxCrete, an open‑source probabilistic modeling and optimization framework trained on a new open‑access dataset of over 500 strength measurements (1‑15 ksi) from 123 mixtures ‑ 69 mortar and 54 concrete mixes tested at five curing ages (1, 3, 5, 14, and 28 days). BOxCrete leverages Gaussian Process (GP) regression to predict strength development, achieving average R^2 = 0.94 and RMSE = 0.69 ksi, quantify uncertainty, and carry out multi‑objective optimization of compressive strength and embodied carbon. The dataset and model establish a reproducible open‑source foundation for data‑driven development of AI‑based optimized mix designs.

Authors:Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey
Title: Mechanisms of Introspective Awareness
Abstract:
Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of "introspective awareness." But what mechanisms underlie this capability, and do they reflect genuine introspective circuitry or more shallow heuristics? We investigate these questions in open‑source models and establish three main findings. First, introspection is behaviorally robust: detection achieves moderate true positive rates with 0% false positives across diverse prompts. We also find this capability emerges specifically from post‑training rather than pretraining. Second, introspection is not reducible to a single linear confound: anomaly detection relies on distributed MLP computation across multiple directions, implemented by evidence carrier and gate features. Third, models possess greater introspective capability than is elicited by default: ablating refusal directions improves detection by 53pp and a trained steering vector by 75pp. Overall, our results suggest that introspective awareness is behaviorally robust, grounded in nontrivial internal anomaly detection, and likely could be substantially improved in future models. Code: https://github.com/safety‑research/introspection‑mechanisms.

Authors:Mohamed A Mabrok
Title: HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis
Abstract:
We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator's phase‑space decomposition yields three functionally distinct representations: position~q (feature content), momentum~p (spatial gradients that encode boundary and texture information), and energy H = \tfrac12|z|^2 (a parameter‑free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task‑specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase‑space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state‑of‑the‑art Dice scores on ISIC\,2018 (89.38%), ISIC\,2017 (88.40%), TN3K (87.05%), and ACDC (92.40%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state‑of‑the‑art accuracy on BloodMNIST (98.85%) and PathMNIST (96.65%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator's momentum consistently encodes an interior\,>\,boundary\,>\,exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds‑R‑Lab/hamvision.

Authors:Pawel Batorski, Paul Swoboda
Title: PLR: Plackett-Luce for Reordering In-Context Learning Examples
Abstract:
In‑context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the n! possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label‑probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in‑context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett‑Luce model. PLR models orderings using a Plackett‑Luce distribution and iteratively updates its parameters to concentrate probability mass on high‑performing orderings under a task‑level metric. Candidate orderings are sampled efficiently via a Gumbel perturb‑and‑sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few‑shot accuracy for k \in \4, 8, 16, 32\ examples, and we further demonstrate gains on mathematical reasoning tasks where label‑based ordering methods are not applicable. Our code is available at https://github.com/Batorskq/PLR.

Authors:Jaber Jaber, Osama Jaber
Title: TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
Abstract:
Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post‑training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto‑detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single‑batch throughput by 6.6%. During autoregressive decoding, 98‑99% of tokens exit early while the model correctly solves a multi‑step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: https://github.com/RightNow‑AI/TIDE

Authors:Oussama Zekri, Théo Uscidda, Nicolas Boullé, Anna Korba
Title: Generalized Discrete Diffusion from Snapshots
Abstract:
We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large‑vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \hrefhttps://oussamazekri.fr/gddshttps://oussamazekri.fr/gdds.

Authors:Jaber Jaber, Osama Jaber
Title: AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Abstract:
Writing high‑performance GPU kernels is among the most labor‑intensive tasks in machine learning systems engineering. We present AutoKernel, an open‑source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five‑stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge‑case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six‑tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max‑autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross‑entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel‑optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow‑AI/autokernel.

Authors:Hongyang Yang, Boyu Zhang, Yang She, Xinyu Liao, Xiaoli Zhang
Title: FinRL-X: An AI-Native Modular Infrastructure for Quantitative Trading
Abstract:
We present FinRL‑X, a modular and deployment‑consistent trading architecture that unifies data processing, strategy construction, backtesting, and broker execution under a weight‑centric interface. While existing open‑source platforms are often backtesting‑ or model‑centric, they rarely provide system‑level consistency between research evaluation and live deployment. FinRL‑X addresses this gap through a composable strategy pipeline that integrates stock selection, portfolio allocation, timing, and portfolio‑level risk overlays within a unified protocol. The framework supports both rule‑based and AI‑driven components, including reinforcement learning allocators and LLM‑based sentiment signals, without altering downstream execution semantics. FinRL‑X provides an extensible foundation for reproducible, end‑to‑end quantitative trading research and deployment. The official FinRL‑X implementation is available at https://github.com/AI4Finance‑Foundation/FinRL‑Trading.

Authors:Fabien Polly
Title: FluidWorld: Reaction-Diffusion Dynamics as a Predictive Substrate for World Models
Abstract:
World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer‑based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self‑attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof‑of‑concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction‑diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter‑matched three‑way ablation on unconditional UCF‑101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self‑attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single‑step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10‑15% higher spatial structure preservation and 18‑25% more effective dimensionality, and critically maintains coherent multi‑step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer‑grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large‑scale compute. These results establish that PDE‑based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter‑efficient alternative to both attention and convolutional recurrence for world modeling.

Authors:Elif Ceren Gok Yildirim, Murat Onur Yildirim, Joaquin Vanschoren
Title: Pruned Adaptation Modules: A Simple yet Strong Baseline for Continual Foundation Models
Abstract:
The continual learning literature has rapidly shifted from traditional class incremental learning (CIL) techniques to foundation model (FM)‑based CIL methods without a clear understanding of how these newer approaches compare to strong, lightweight convolutional baselines. This abrupt transition has created a substantial methodological gap, making it difficult to assess whether recent FM‑based CIL progress reflects genuine advances or merely the absence of rigorous baselines. To address this gap, we introduce Pruned Adaptation Modules (PAM), a simple yet effective method that freezes the vast majority of the pre‑trained ResNet while enabling scalable continual adaptation through sparse task‑specific layers. PAM yields up to a ~5x reduction in trainable parameters and a ~6x reduction in total parameters, significantly reducing the cost of continual updates. Across diverse benchmarks, PAM consistently mitigates catastrophic forgetting and outperforms state‑of‑the‑art FM‑based CIL approaches. Our findings position PAM as a strong and transparent baseline that helps bridge the gap between traditional and FM‑based CIL, guiding future research for a more accurate assessment of true progress in continual adaptation. The code can be found at: https://github.com/ElifCerenGokYildirim/PAM.

Authors:Tianhao Ma, Ximing Li, Changchun Li, Renchu Guan
Title: Learning from Label Proportions with Dual-proportion Constraints
Abstract:
Learning from Label Proportions (LLP) is a weakly supervised problem in which the training data comprise bags, that is, groups of instances, each annotated only with bag‑level class label proportions, and the objective is to learn a classifier that predicts instance‑level labels. This setting is widely applicable when privacy constraints limit access to instance‑level annotations or when fine‑grained labeling is costly or impractical. In this work, we introduce a method that leverages Dual proportion Constraints (LLP‑DC) during training, enforcing them at both the bag and instance levels. Specifically, the bag‑level training aligns the mean prediction with the given proportion, and the instance‑level training aligns hard pseudo‑labels that satisfy the proportion constraint, where a minimum‑cost maximum‑flow algorithm is used to generate hard pseudo‑labels. Extensive experimental results across various benchmark datasets empirically validate that LLP‑DC consistently improves over previous LLP methods across datasets and bag sizes. The code is publicly available at https://github.com/TianhaoMa5/CV PR2026_Findings_LLP_DC.

Authors:Shih-Wen Liu, Yen-Chang Chen, Wei-Ta Chu, Fu-En Yang, Yu-Chiang Frank Wang
Title: Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning
Abstract:
Multi‑task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter‑efficient fine‑tuning (PEFT) methods remain largely limited to single‑task adaptation. We introduce Free Sinewich, a parameter‑efficient multi‑task learning framework that enables near‑zero‑cost weight modulation via frequency switching (Free). Specifically, a Sine‑AWB (Sinewich) layer combines low‑rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task‑specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low‑rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state‑of‑the‑art performance‑efficiency trade‑offs (e.g., up to +5.39% improvement over single‑task fine‑tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency‑based parameter sharing. Project page: \hrefhttps://casperliuliuliu.github.io/projects/Free‑Sinewich/https://casperliuliuliu.github.io/projects/Free‑Sinewich.

Authors:Long Xu, Junping Guo, Jianbo Zhao, Jianbo Lu, Yuzhong Peng
Title: DMMRL: Disentangled Multi-Modal Representation Learning via Variational Autoencoders for Molecular Property Prediction
Abstract:
Molecular property prediction constitutes a cornerstone of drug discovery and materials science, necessitating models capable of disentangling complex structure‑property relationships across diverse molecular modalities. Existing approaches frequently exhibit entangled representations‑‑conflating structural, chemical, and functional factors‑‑thereby limiting interpretability and transferability. Furthermore, conventional methods inadequately exploit complementary information from graphs, sequences, and geometries, often relying on naive concatenation that neglects inter‑modal dependencies. In this work, we propose DMMRL, which employs variational autoencoders to disentangle molecular representations into shared (structure‑relevant) and private (modality‑specific) latent spaces, enhancing both interpretability and predictive performance. The proposed variational disentanglement mechanism effectively isolates the most informative features for property prediction, while orthogonality and alignment regularizations promote statistical independence and cross‑modal consistency. Additionally, a gated attention fusion module adaptively integrates shared representations, capturing complex inter‑modal relationships. Experimental validation across seven benchmark datasets demonstrates DMMRL's superior performance relative to state‑of‑the‑art approaches. The code and data underlying this article are freely available at https://github.com/xulong0826/DMMRL.

Authors:Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi
Title: Mixture of Chapters: Scaling Learnt Memory in Transformers
Abstract:
Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end‑to‑end, that transformer layers query via cross‑attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter‑based routing inspired by Mixture‑of‑Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso‑FLOP settings) on pre‑training and instruction fine‑tuning across relevant benchmarks. Our models surpass iso‑FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine‑tuning).

Authors:Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He
Title: Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
Abstract:
Large language models (LLMs) used for multiple‑choice and pairwise evaluation tasks often exhibit selection bias due to non‑semantic factors like option positions and label symbols. Existing inference‑time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation‑Aware Group Relative Policy Optimization (PA‑GRPO), which mitigates selection bias by enforcing permutation‑consistent semantic reasoning. PA‑GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross‑permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency‑aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA‑GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU‑Text‑Computing/PA‑GRPO).

Authors:Florent Draye, Abir Harrasse, Vedant Palit, Tung-Yu Wu, Jiarui Liu, Punya Syon Pandey, Roderick Wu, Terry Jingchen Zhang, Zhijing Jin, Bernhard Schölkopf
Title: CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs
Abstract:
Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross‑Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer‑specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open‑source library for end‑to‑end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit‑Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT‑based mechanistic interpretability. Our code is available at: https://github.com/LLM‑Interp/CLT‑Forge.

Authors:Jason Dury
Title: Beyond Expression Similarity: Contrastive Learning Recovers Functional Gene Associations from Protein Interaction Structure
Abstract:
The Predictive Associative Memory (PAM) framework posits that useful relationships often connect items that co‑occur in shared contexts rather than items that appear similar in embedding space. A contrastive MLP trained on co‑occurrence annotations‑‑Contrastive Association Learning (CAL)‑‑has improved multi‑hop passage retrieval and discovered narrative function at corpus scale in text. We test whether this principle transfers to molecular biology, where protein‑protein interactions provide functional associations distinct from gene expression similarity. Four experiments across two biological domains map the operating envelope. On gene perturbation data (Replogle K562 CRISPRi, 2,285 genes), CAL trained on STRING protein interactions achieves cross‑boundary AUC of 0.908 where expression similarity scores 0.518. A second gene dataset (DepMap, 17,725 genes) confirms the result after negative sampling correction, reaching cross‑boundary AUC of 0.947. Two drug sensitivity experiments produce informative negatives that sharpen boundary conditions. Three cross‑domain findings emerge: (1) inductive transfer succeeds in biology‑‑a node‑disjoint split with unseen genes yields AUC 0.826 (Delta +0.127)‑‑where it fails in text (+/‑0.10), suggesting physically grounded associations are more transferable than contingent co‑occurrences; (2) CAL scores anti‑correlate with interaction degree (Spearman r = ‑0.590), with gains concentrating on understudied genes with focused interaction profiles; (3) tighter association quality outperforms larger but noisier training sets, reversing the text pattern. Results are stable across training seeds (SD < 0.001) and cross‑boundary threshold choices.

Authors:Yong Wang, Qifan Shen, Bao Zhang, Zijun Huang, Chengbo Zhu, Shuai Yao, Qisong Wu
Title: mmWave-Diffusion:A Novel Framework for Respiration Sensing Using Observation-Anchored Conditional Diffusion Model
Abstract:
Millimeter‑wave (mmWave) radar enables contactless respiratory sensing,yet fine‑grained monitoring is often degraded by nonstationary interference from body micromotions.To achieve micromotion interference removal,we propose mmWave‑Diffusion,an observation‑anchored conditional diffusion framework that directly models the residual between radar phase observations and the respiratory ground truth,and initializes sampling within an observation‑consistent neighborhood rather than from Gaussian noise‑thereby aligning the generative process with the measurement physics and reducing inference overhead. The accompanying Radar Diffusion Transformer (RDT) is explicitly conditioned on phase observations, enforces strict one‑to‑one temporal alignment via patch‑level dual positional encodings, and injects local physical priors through banded‑mask multi‑head cross‑attention, enabling robust denoising and interference removal in just 20 reverse steps. Evaluated on 13.25 hours of synchronized radar‑respiration data, mmWave‑Diffusion achieves state‑of‑the‑art waveform reconstruction and respiratory‑rate estimation with strong generalization. Code repository:https://github.com/goodluckyongw/mmWave‑Diffusion.

Authors:Hung Yun Tseng, Wuzhen Li, Blerina Gkotse, Grigorios Chrysos
Title: LJ-Bench: Ontology-Based Benchmark for U.S. Crime
Abstract:
The potential of Large Language Models (LLMs) to provide harmful information remains a significant concern due to the vast breadth of illegal queries they may encounter. Unfortunately, existing benchmarks only focus on a handful types of illegal activities, and are not grounded in legal works. In this work, we introduce an ontology of crime‑related concepts grounded in the legal frameworks of Model Panel Code, which serves as an influential reference for criminal law and has been adopted by many U.S. states, and instantiated using Californian Law. This structured knowledge forms the foundation for LJ‑Bench, the first comprehensive benchmark designed to evaluate LLM robustness against a wide range of illegal activities. Spanning 76 distinct crime types organized taxonomically, LJ‑Bench enables systematic assessment of diverse attacks, revealing valuable insights into LLM vulnerabilities across various crime categories: LLMs exhibit heightened susceptibility to attacks targeting societal harm rather than those directly impacting individuals. Our benchmark aims to facilitate the development of more robust and trustworthy LLMs. The LJ‑Bench benchmark and LJ‑Ontology, along with experiments implementation for reproducibility are publicly available at https://github.com/AndreaTseng/LJ‑Bench.

Authors:Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, Yaoqing Yang
Title: RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
Abstract:
Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape . The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textscMuon stands out by using Newton‑Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of \textscMuon still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton‑Schulz iteration with a simple row‑wise \ell_2 normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per‑iteration computational complexity from \mathcalO(mn\cdot\min(m,n)) to \mathcalO(mn) for an m× n weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for RMNP in the non‑convex setting that match recent results for Muon optimizers, achieving the information‑theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall‑clock time. Our code is available at \hrefhttps://anonymous.4open.science/r/RMNP‑E8E1/this link.

Authors:Truong Quynh Hoa, Hoang Dinh Cuong, Truong Xuan Khanh
Title: Detecting Neurovascular Instability from Multimodal Physiological Signals Using Wearable-Compatible Edge AI: A Responsible Computational Framework
Abstract:
We propose Melaguard, a multimodal ML framework (Transformer‑lite, 1.2M parameters, 4‑head self‑attention) for detecting neurovascular instability (NVI) from wearable‑compatible physiological signals prior to structural stroke pathology. The model fuses heart rate variability (HRV), peripheral perfusion index, SpO2, and bilateral phase coherence into a composite NVI Score, designed for edge inference (WCET <=4 ms on Cortex‑M4). NVI ‑ the pre‑structural dysregulation of cerebrovascular autoregulation preceding overt stroke ‑ remains undetectable by existing single‑modality wearables. With 12.2 million incident strokes annually, continuous multimodal physiological monitoring offers a practical path to community‑scale screening. Three‑stage independent validation: (1) synthetic benchmark (n=10,000), AUC=0.88 [0.83‑0.92]; (2) clinical cohort PhysioNet CVES (n=172; 84 stroke, 88 control) ‑ Transformer‑lite achieves AUC=0.755 [0.630‑0.778], outperforming LSTM (0.643), Random Forest (0.665), SVM (0.472); HRV‑SDNN discriminates stroke (p=0.011); (3) PPG pipeline PhysioNet BIDMC (n=53) ‑‑ pulse rate r=0.748 and HRV surrogate r=0.690 vs. ECG ground truth. Cross‑modality validation on PPG‑BP (n=219) confirms PPG morphology classifies cerebrovascular disease at AUC=0.923 [0.869‑0.968]. Multimodal fusion consistently outperforms single‑modality baselines. Code: https://github.com/ClevixLab/Melaguard

Authors:Simon Ambrozak, Ulysse McConnell, Bhargav Srinivasan, Burak Ozkan, Can Firtina
Title: CERN: Correcting Errors in Raw Nanopore Signals Using Hidden Markov Models
Abstract:
Nanopore sequencing can read substantially longer sequences of nucleic acid molecules than other sequencing methods, which has led to advances in genomic analysis such as the gapless human genome assembly. By analyzing the raw electrical signal reads that nanopore sequencing generates from molecules, existing works can map these reads without translating them into DNA characters (i.e., basecalling), allowing for quick and efficient analysis of sequencing data. However, raw signals often contain errors due to noise and mistakes when processing them, which limits the overall accuracy of raw signal analysis. Our goal in this work is to detect and correct errors in raw signals to improve the accuracy of raw signal analyses. To this end, we propose CERN, a mechanism that trains and utilizes a Hidden Markov Model (HMM) to accurately correct signal errors. Our extensive evaluation on various datasets including E. coli, Fruit Fly, and Human genomes shows that CERN 1) consistently improves the overall mapping accuracy of the underlying raw signal analysis tools, 2) minimizes the burden on segmentation algorithm optimization with newer nanopore chemistries, and 3) functions without causing substantial computational overhead. We conclude that CERN provides an effective mechanism to systematically identify and correct the errors in raw nanopore signals before further analysis, which can enable the development of a new class of error correction mechanisms purely designed for raw nanopore signals. CERN is available at https://github.com/STORMgroup/CERN. We also provide the scripts to fully reproduce our results on our GitHub page.

Authors:Dhruv Menon, Vivek Singh, Xu Chen, Mohammad Reza Alizadeh Kiapi, Ivan Zyuzin, Hamish W. Macleod, Nakul Rampal, William Shepard, Omar M. Yaghi, David Fairen-Jimenez
Title: A chemical language model for reticular materials design
Abstract:
Reticular chemistry has enabled the synthesis of tens of thousands of metal‑organic frameworks (MOFs), yet the discovery of new materials still relies largely on intuition‑driven linker design and iterative experimentation. As a result, researchers explore only a small fraction of the vast chemical space accessible to reticular materials, limiting the systematic discovery of frameworks with targeted properties. Here, we introduce Nexerra‑R1, a building‑block chemical language model that enables inverse design in reticular chemistry through the targeted generation of organic linkers. Rather than generating complete frameworks directly, Nexerra‑R1 operates at the level of molecular building blocks, preserving the modular logic that underpins reticular synthesis. The model supports both unconstrained generation of low‑connectivity linkers and scaffold‑constrained design of symmetric multidentate motifs compatible with predefined nodes and topologies. We further combine linker generation with flow‑guided distributional targeting to steer the generative process toward application‑relevant objectives while maintaining chemical validity and assembly feasibility. The generated linkers are subsequently assembled into three‑dimensional frameworks and are structurally optimized to produce candidate materials compatible with experimental synthesis. Using Nexerra‑R1, we validate this strategy by rediscovering known MOFs and by proposing the experimental synthesis of a previously unreported framework, CU‑525, generated entirely in silico. Together, these results establish a general inverse‑design paradigm for reticular materials in which controllable chemical language modelling enables the direct translation from computational design to synthesizable frameworks.

Authors:Liu hung ming
Title: Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations
Abstract:
Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary‑free probe that converts V‑JEPA 2 continuous latent vectors into discrete symbol sequences without task‑specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V‑JEPA 2 pre‑trained representations ‑‑ not to the probe. We evaluate through category‑contrast experiments on Kinetics‑mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^‑4; MI 0.036‑‑0.117 bits, NMI 1.2‑‑3.9% of the 3‑bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V‑JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four‑stage roadmap toward an action‑conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.

Authors:Alex Popa, Adrian Taylor, Ranwa Al Mallah
Title: Learning Communication Between Heterogeneous Agents in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence
Abstract:
Reinforcement learning techniques are being explored as solutions to the threat of cyber attacks on enterprise networks. Recent research in the field of AI in cyber security has investigated the ability of homogeneous multi‑agent reinforcement learning agents, capable of inter‑agent communication, to respond to cyberattacks. This paper advances the study of learned communication in multi‑agent systems by examining heterogeneous agent capabilities within a simulated network environment. To this end, we leverage CommFormer, a publicly available state‑of‑the‑art communication algorithm, to train and evaluate agents within the Cyber Operations Research Gym (CybORG). Our results show that CommFormer agents with heterogeneous capabilities can outperform other algorithms deployed in the CybORG environment, by converging to an optimal policy up to four times faster while improving standard error by up 38%. The agents implemented in this project provide an additional avenue for exploration in the field of AI for cyber security, enabling further research involving realistic networks.

Authors:Behnood Rasti, Bikram Koirala, Paul Scheunders
Title: MiSiSUn: Minimum Simplex Semisupervised Unmixing
Abstract:
This paper proposes a semisupervised geometric unmixing approach called minimum simplex semisupervised unmixing (MiSiSUn). The geometry of the data was incorporated for the first time into library‑based unmixing using a simplex‑volume‑flavored penalty based on an archetypal analysis‑type linear model. The experimental results were performed on two simulated datasets considering different levels of mixing ratios and spatial instruction at varying input noise. MiSiSUn considerably outperforms state‑of‑the‑art semisupervised unmixing methods. The improvements vary from 1 dB to over 3 dB in different scenarios. The proposed method was also applied to a real dataset where visual interpretation is close to the geological map. MiSiSUn was implemented using PyTorch, which is open‑source and available at https://github.com/BehnoodRasti/MiSiSUn. Moreover, we provide a dedicated Python package for Semisupervised Unmixing, which is open‑source and includes all the methods used in the experiments for the sake of reproducibility.

Authors:Yadi Cao, Sicheng Lai, Jiahe Huang, Yang Zhang, Zach Lawrence, Rohan Bhakta, Izzy F. Thomas, Mingyun Cao, Chung-Hao Tsai, Zihao Zhou, Yidong Zhao, Hao Liu, Alessandro Marinoni, Alexey Arefiev, Rose Yu
Title: SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs
Abstract:
Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool‑use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost‑sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost‑sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single‑round (initial guess) and 1,900 multi‑round (adjustment by trial‑and‑error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform‑independent. Frontier LLMs achieve 46‑‑64% success rates in single‑round mode, dropping to 35‑‑54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi‑round mode improves rates to 71‑‑80%, but LLMs are 1.5‑‑2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in‑context examples and reasoning effort, providing practical implications for deployment and fine‑tuning. We open‑source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost‑aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at https://github.com/Rose‑STL‑Lab/SimulCost‑Bench.

Authors:Jiaqi Yuan, Jialu Wang, Zihan Wang, Qingyun Sun, Ruijie Wang, Jianxin Li
Title: AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization
Abstract:
Generative search engines represent a transition from traditional ranking‑based retrieval to Large Language Model (LLM)‑based synthesis, transforming optimization goals from ranking prominence towards content inclusion. Generative Engine Optimization (GEO), specifically, aims to maximize visibility and attribution in black‑box summarized outputs by strategically manipulating source content. However, existing methods rely on static heuristics, single‑prompt optimization, or engine preference rule distillation that is prone to overfitting. They cannot flexibly adapt to diverse content or the changing behaviors of generative engines. Moreover, effectively optimizing these strategies requires an impractical amount of interaction feedback from the engines. To address these challenges, we propose AgenticGEO, a self‑evolving agentic framework formulating optimization as a content‑conditioned control problem, which enhances intrinsic content quality to robustly adapt to the unpredictable behaviors of black‑box engines. Unlike fixed‑strategy methods, AgenticGEO employs a MAP‑Elites archive to evolve diverse, compositional strategies. To mitigate interaction costs, we introduce a Co‑Evolving Critic, a lightweight surrogate that approximates engine feedback for content‑specific strategy selection and refinement, efficiently guiding both evolutionary search and inference‑time planning. Through extensive in‑domain and cross‑domain experiments on two representative engines, AgenticGEO achieves state‑of‑the‑art performance and demonstrates robust transferability, outperforming 14 baselines across 3 datasets. Our code and model are available at: https://github.com/AIcling/agentic_geo.

Authors:Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen
Title: From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
Abstract:
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel‑grounded, meaning and language‑aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low‑level changes to high‑level understanding. Second, we release a new benchmark with per‑pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel‑level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics‑aware classification and natural language descriptions for the predicted regions. We also re‑evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over‑ and under‑scoring using mask‑only metrics, and expose failure modes on micro‑edits and off‑mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA‑Lab/PIXAR.

Authors:Alejandro Almodóvar, Mar Elizo, Patricia A. Apellániz, Santiago Zazo, Juan Parras
Title: Kolmogorov-Arnold causal generative models
Abstract:
Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data. However, many deep causal models rely on highly expressive architectures with opaque mechanisms, limiting auditability in high‑stakes domains. We propose KaCGM, a causal generative model for mixed‑type tabular data where each structural equation is parameterized by a Kolmogorov‑‑Arnold Network (KAN). This decomposition enables direct inspection of learned causal mechanisms, including symbolic approximations and visualization of parent‑‑child relationships, while preserving query‑agnostic generative semantics. We introduce a validation pipeline based on distributional matching and independence diagnostics of inferred exogenous variables, allowing assessment using observational data alone. Experiments on synthetic and semi‑synthetic benchmarks show competitive performance against state‑of‑the‑art methods. A real‑world cardiovascular case study further demonstrates the extraction of simplified structural equations and interpretable causal effects. These results suggest that expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision‑making settings. Code: https://github.com/aalmodovares/kacgm

Authors:Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar
Title: The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus
Abstract:
LLMs are increasingly used as general‑purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open‑ended read‑eval‑print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse. We introduce λ‑RLM, a framework for long‑context reasoning that replaces free‑form recursive code generation with a typed functional runtime grounded in λ‑calculus. It executes a compact library of pre‑verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that λ‑RLM admits formal guarantees absent from standard RLMs, including termination, closed‑form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long‑context reasoning tasks and nine base models, λ‑RLM outperforms standard RLM in 29 of 36 model‑task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long‑context reasoning than open‑ended recursive code generation. The complete implementation of λ‑RLM, is open‑sourced for the community at: https://github.com/lambda‑calculus‑LLM/lambda‑RLM.

Authors:Henry J. Kobs
Title: Continual Learning as Shared-Manifold Continuation Under Compatible Shift
Abstract:
Continual learning methods usually preserve old behavior by regularizing parameters, matching old outputs, or replaying previous examples. These strategies can reduce forgetting, but they do not directly specify how the latent representation should evolve. We study a narrower geometric alternative for the regime where old and new data should remain on the same latent support: continual learning as continuation of a shared manifold. We instantiate this view within Support‑Preserving Manifold Assimilation (SPMA) and evaluate a geometry‑preserving variant, SPMA‑OG, that combines sparse replay, output distillation, relational geometry preservation, local smoothing, and chart‑assignment regularization on old anchors. On representative compatible‑shift CIFAR10 and Tiny‑ImageNet runs, SPMA‑OG improves over sparse replay baselines in old‑task retention and representation‑preservation metrics while remaining competitive on new‑task accuracy. On a controlled synthetic atlas‑manifold benchmark, it achieves near‑perfect anchor‑geometry preservation while also improving new‑task accuracy over replay. These results provide evidence that geometry‑aware anchor regularization is a useful inductive bias when continual learning should preserve a shared latent support rather than create a new one.

Authors:Leonardo Kuffo, Sven Hepkema, Peter Boncz
Title: A Super Fast K-means for Indexing Vector Embeddings
Abstract:
We present SuperKMeans: a k‑means variant designed for clustering collections of high‑dimensional vector embeddings. SuperKMeans' clustering is up to 7x faster than FAISS and Scikit‑Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data‑access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early‑terminates k‑means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open‑source our implementation at https://github.com/cwida/SuperKMeans

Authors:Hao Wang, Licheng Pan, Qingsong Wen, Jialin Yu, Zhichao Chen, Chunyuan Zheng, Xiaoxi Li, Zhixuan Chu, Chao Xu, Mingming Gong, Haoxuan Li, Yuan Lu, Zhouchen Lin, Philip Torr, Yan Liu
Title: Deep Autocorrelation Modeling for Time-Series Forecasting: Progress and Prospects
Abstract:
Autocorrelation is a defining characteristic of time‑series data, where each observation is statistically dependent on its predecessors. In the context of deep time‑series forecasting, autocorrelation arises in both the input history and the label sequences, presenting two central research challenges: (1) designing neural architectures that model autocorrelation in history sequences, and (2) devising learning objectives that model autocorrelation in label sequences. Recent studies have made strides in tackling these challenges, but a systematic survey examining both aspects remains lacking. To bridge this gap, this paper provides a comprehensive review of deep time‑series forecasting from the perspective of autocorrelation modeling. In contrast to existing surveys, this work makes two distinctive contributions. First, it proposes a novel taxonomy that encompasses recent literature on both model architectures and learning objectives ‑‑ whereas prior surveys neglect or inadequately discuss the latter aspect. Second, it offers a thorough analysis of the motivations, insights, and progression of the surveyed literature from a unified, autocorrelation‑centric perspective, providing a holistic overview of the evolution of deep time‑series forecasting. The full list of papers and resources is available at https://github.com/Master‑PLC/Awesome‑TSF‑Papers.

Authors:Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan
Title: What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
Abstract:
Test‑Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo‑rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo‑labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective‑Complementary Reinforcement Learning), a robust test‑time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo‑Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy‑Gated Negative Pseudo‑Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper‑Yan/SCRL.

Authors:Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov
Title: IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
Abstract:
Vision‑Language Models like CLIP are extensively used for inter‑modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra‑modal tasks like image‑to‑image retrieval, their performance suffers from the intra‑modal misalignment. In this paper we study intra‑modal misalignment in CLIP with a focus on the role of the projectors that map pre‑projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter‑modal operator responsible for aligning the two modalities during training, and a second, intra‑modal operator that only enforces intra‑modal normalization but does nothing to promote intra‑modal alignment. Via spectral analysis of the inter‑modal operator, we identify an approximately isotropic subspace in which the two modalities are well‑aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra‑modal alignment. Our experiments on intra‑modal retrieval and classification benchmarks show that our training‑free method reduces intra‑modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre‑trained CLIP‑like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

Authors:Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang
Title: The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference
Abstract:
The key‑value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit‑identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross‑task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information‑carrying state. Removing the cache entirely and recomputing from scratch yields token‑identical output under greedy decoding on all models tested. We build on this result with KV‑Direct, a bounded‑memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3‑4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV‑Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window‑only), KV‑Direct maintains 100% token match at every cache budget; all baselines degrade to 5‑28%. A per‑operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV‑Direct.

Authors:Minghe Xu, Rouying Wu, ChiaWei Chu, Xiao Wang, Yu Li
Title: PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition
Abstract:
Event‑based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low‑light and motion‑blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two‑stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency‑domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory‑augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross‑attention mechanism fuses the RGB and event modalities, followed by feed‑forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB‑Event PAR framework. The source code of this paper will be released on https://github.com/Event‑AHU/OpenPAR

Authors:J. Ben Tamo, Yuxing Lu, Benoit L. Marteau, Micky C. Nnamdi, May D. Wang
Title: EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
Abstract:
Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high‑stakes domains where decisions must be justified by verifiable information. We introduce EvidenceRL, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high‑stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama‑3.2‑3B while grounding (G_\max@3) rises from 47.6 to 78.2; hallucinations drop nearly 5× and evidence‑supported diagnoses increase from 31.8% to 61.6%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8% to 67.6% on Llama‑3.1‑8B, demonstrating consistent behavioral change across domains. Our code is open‑sourced at https://github.com/Wizaaard/EvidenceRL.git.

Authors:Jinming Wang, Hai Wang, Hongkai Wen, Geyong Min, Man Luo
Title: TRACE: Trajectory Recovery with State Propagation Diffusion for Urban Mobility
Abstract:
High‑quality GPS trajectories are essential for location‑based web services and smart city applications, including navigation, ride‑sharing and delivery. However, due to low sampling rates and limited infrastructure coverage during data collection, real‑world trajectories are often sparse and feature unevenly distributed location points. Recovering these trajectories into dense and continuous forms is essential but challenging, given their complex and irregular spatio‑temporal patterns. In this paper, we introduce a novel diffusion model for trajectory recovery named TRACE, which reconstruct dense and continuous trajectories from sparse and incomplete inputs. At the core of TRACE, we propose a State Propagation Diffusion Model (SPDM), which integrates a novel memory mechanism, so that during the denoising process, TRACE can retain and leverage intermediate results from previous steps to effectively reconstruct those hard‑to‑recover trajectory segments. Extensive experiments on multiple real‑world datasets show that TRACE outperforms the state‑of‑the‑art, offering >26% accuracy improvement without significant inference overhead. Our work strengthens the foundation for mobile and web‑connected location services, advancing the quality and fairness of data‑driven urban applications. Code is available at: https://github.com/JinmingWang/TRACE

Authors:Nathan Weill, Kaizheng Wang
Title: Pseudo-Labeling for Unsupervised Domain Adaptation with Kernel GLMs
Abstract:
We propose a principled framework for unsupervised domain adaptation under covariate shift in kernel Generalized Linear Models (GLMs), encompassing kernelized linear, logistic, and Poisson regression with ridge regularization. Our goal is to minimize prediction error in the target domain by leveraging labeled source data and unlabeled target data, despite differences in covariate distributions. We partition the labeled source data into two batches: one for training a family of candidate models, and the other for building an imputation model. This imputation model generates pseudo‑labels for the target data, enabling robust model selection. We establish non‑asymptotic excess‑risk bounds that characterize adaptation performance through an "effective labeled sample size", explicitly accounting for the unknown covariate shift. Experiments on synthetic and real datasets demonstrate consistent performance gains over source‑only baselines.

Authors:Tomasz Wietrzykowski
Title: Anatomical Heterogeneity in Transformer Language Models
Abstract:
Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2‑135M, a 30‑layer, 135M‑parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= ‑0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8‑11, up to +63,419% PPL degradation) to anti‑layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof‑of‑concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.

Authors:Celal Alagöz, Mehmet Kurnaz, Farhan Aadil
Title: MSNet and LS-Net: Scalable Multi-Scale Multi-Representation Networks for Time Series Classification
Abstract:
Time series classification (TSC) performance depends not only on architectural design but also on the diversity of input representations. In this work, we propose a scalable multi‑scale convolutional framework that systematically integrates structured multi‑representation inputs for univariate time series. We introduce two architectures: MSNet, a hierarchical multi‑scale convolutional network optimized for robustness and calibration, and LS‑Net, a lightweight variant designed for efficiency‑aware deployment. In addition, we adapt LiteMV ‑‑ originally developed for multivariate inputs ‑‑ to operate on multi‑representation univariate signals, enabling cross‑representation interaction. We evaluate all models across 142 benchmark datasets under a unified experimental protocol. Critical Difference analysis confirms statistically significant performance differences among the top models. Results show that LiteMV achieves the highest mean accuracy, MSNet provides superior probabilistic calibration (lowest NLL), and LS‑Net offers the best efficiency‑accuracy tradeoff. Pareto analysis further demonstrates that multi‑representation multi‑scale modeling yields a flexible design space that can be tuned for accuracy‑oriented, calibration‑oriented, or resource‑constrained settings. These findings establish scalable multi‑representation multi‑scale learning as a principled and practical direction for modern TSC. Reference implementation of MSNet and LS‑Net is available at: https://github.com/alagoz/msnet‑lsnet‑tsc

Authors:Wentao Wang, Haoran Xu, Guang Tan
Title: GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space
Abstract:
In autonomous driving, multi‑agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling \em heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose \em GT‑Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT‑Space constructs a common feature space from ground‑truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real‑world dataset (RCooper) demonstrate that GT‑Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT‑Space.

Authors:Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda
Title: Speculating Experts Accelerates Inference for Mixture-of-Experts
Abstract:
Mixture‑of‑Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per‑token compute. However, in memory‑constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU‑GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute‑memory overlap by eliminating the need to re‑fetch true router‑selected experts. Integrated into an optimized inference engine, our approach achieves up to 14% reduction in time per output token (TPOT) over on‑demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open‑source at https://github.com/axonn‑ai/yalis/tree/offload_prefetch.

Authors:Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu
Title: Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
Abstract:
Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underlineFirst, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underlineSecond, to improve generalization, Explanatory GRPO (\textttEXGRPO) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma‑7b as the student model, our method yields an average 20.39% increase over zero‑shot performance and a 6.02% improvement over the state‑of‑the‑art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine‑tuning with 10‑25% training data) and strong generalization to out‑of‑distribution tasks. Implementation is released at https://github.com/Zhen‑Tan‑dmml/ExGRPO.git.

Authors:Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu
Title: DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
Abstract:
With the growing adoption of vision‑language‑action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter‑view inconsistency when applied to high‑resolution multi‑view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi‑view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross‑attention. For decoding, we employ a multi‑view transformer to reconstruct multi‑view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi‑view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

Authors:Shang-Jui Ray Kuo, Paola Cascante-Bonilla
Title: Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Abstract:
Large vision‑‑language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer‑based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet‑1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT‑family backbones with detection or segmentation training and find that dense‑task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer‑based vision encoders in VLMs.

Authors:Masoumeh Shafieinejad, Xi He, Mahshid Alinoori, John Jewell, Sana Ayromlou, Wei Pang, Veronica Chatrath, Garui Sharma, Deval Pandya
Title: MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data
Abstract:
Synthetic data is often perceived as a silver‑bullet solution to data anonymization and privacy‑preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi‑relational tables with interconnected constraints. MIDST inspired the development of novel black‑box and white‑box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST

Authors:Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan
Title: Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Abstract:
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state‑of‑the‑art VLM‑based grounding approaches struggle with complex metric‑semantic language queries. To address this limitation, we propose MAPG (Multi‑Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM‑EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG‑Bench, specifically designed to evaluate metric‑semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real‑world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

Authors:Tianci Luo, Jinpeng Wang, Shiyu Qin, Niu Lian, Yan Feng, Bin Chen, Chun Yuan, Shu-Tao Xia
Title: PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment
Abstract:
Visual In‑Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch‑wise fusion framework and model‑agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi‑prompting through locality‑aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out‑of‑distribution settings, and various retrieval scenarios. This work establishes a reliable locality‑aware paradigm for prompt fusion, moving beyond prior patch‑wise approaches. Code is available at https://github.com/luotc‑why/ICLR26‑PromptHub.

Authors:Yizhou Han, Di Wu, Blesson Varghese
Title: DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning
Abstract:
In real‑world Federated Learning (FL) deployments, data distributions on devices that participate in training evolve over time. This leads to asynchronous data drift, where different devices shift at different times and toward different distributions. Mitigating such drift is challenging: frequent retraining incurs high computational cost on resource‑constrained devices, while infrequent retraining degrades performance on drifting devices. We propose DriftGuard, a federated continual learning framework that efficiently adapts to asynchronous data drift. DriftGuard adopts a Mixture‑of‑Experts (MoE) inspired architecture that separates shared parameters, which capture globally transferable knowledge, from local parameters that adapt to group‑specific distributions. This design enables two complementary retraining strategies: (i) global retraining, which updates the shared parameters when system‑wide drift is identified, and (ii) group retraining, which selectively updates local parameters for clusters of devices identified via MoE gating patterns, without sharing raw data. Experiments across multiple datasets and models show that DriftGuard matches or exceeds state‑of‑the‑art accuracy while reducing total retraining cost by up to 83%. As a result, it achieves the highest accuracy per unit retraining cost, improving over the strongest baseline by up to 2.3x. DriftGuard is available for download from https://github.com/blessonvar/DriftGuard.

Authors:Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng
Title: RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
Abstract:
Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine‑grained, state‑level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state‑level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state‑wise contributions to success, followed by topology‑aware graph propagation to quantify contributions and yield objective, state‑level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr‑group/RewardFlow.

Authors:Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang
Title: Memento-Skills: Let Agents Design Agents
Abstract:
We introduce \emphMemento‑Skills, a generalist, continually‑learnable LLM agent system that functions as an \emphagent‑designing agent: it autonomously constructs, adapts, and improves task‑specific agents through experience. The system is built on a memory‑based reinforcement learning framework with \emphstateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emphRead‑‑Write Reflective Learning mechanism introduced in \emphMemento~2~\citewang2025memento2. In the \emphread phase, a behaviour‑trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emphwrite phase, the agent updates and expands its skill library based on new experience. This closed‑loop design enables \emphcontinual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human‑designed agents, Memento‑Skills enables a generalist agent to \emphdesign agents end‑to‑end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emphGeneral AI Assistants benchmark and \emphHumanity's Last Exam demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento‑Teams/Memento‑Skills.

Authors:Yinan Xia, Haotian Zhang, Huiming Wang
Title: Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning
Abstract:
Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty‑Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty‑level average as a well‑founded reference for length optimization. Extensive experiments on both in‑domain and out‑of‑domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade‑off between accuracy and length. The code is available at https://github.com/Yinan‑Xia/DDPO.

Authors:Seonghyun Jin, Jong Chul Ye
Title: FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction
Abstract:
Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant‑memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training‑free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per‑token variance and computes a Kalman‑style gain that adaptively balances memory retention against new observations. Process noise ‑‑ governing how much the latent state is expected to change between frames ‑‑ is estimated online from EMA‑normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug‑in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long‑horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.

Authors:Chengxuan Lu, Shukuan Wang, Yanjie Li, Wei Liu, Shiji Jin, Fuyuan Qian, Peiming Li, Baigui Sun, Yang Liu
Title: AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models
Abstract:
Reinforcement learning (RL) for large‑scale Vision‑Language‑Action (VLA) models faces significant challenges in computational efficiency and data acquisition. We propose AcceRL, a fully asynchronous and decoupled RL framework designed to eliminate synchronization barriers by physically isolating training, inference, and rollouts. Crucially, AcceRL is the first to integrate a plug‑and‑play, trainable world model into a distributed asynchronous RL pipeline to generate virtual experiences. Experiments on the LIBERO~\citeliu2023libero benchmark demonstrate that AcceRL achieves state‑of‑the‑art (SOTA) performance. Systematically, it exhibits super‑linear scaling in throughput and highly efficient hardware utilization. Algorithmically, the world‑model‑augmented variant delivers unprecedented sample efficiency and robust training stability in complex control tasks. Code is publicly available at https://github.com/distanceLu/AcceRL.

Authors:Jason Dury
Title: From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory
Abstract:
Embedding models group text by semantic content, what text is about. We show that temporal co‑occurrence within texts discovers a different kind of structure: recurrent transition‑structure concepts or what text does. We train a 29.4M‑parameter contrastive model on 373 million co‑occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre‑trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co‑occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi‑resolution concept map; from broad modes like "direct confrontation" and "lyrical meditation" to precise registers and scene templates like "sailor dialect" and "courtroom cross‑examination." At k=100, clusters average 4,508 books each (of 9,766), confirming corpus‑wide patterns. Direct comparison with embedding‑similarity clustering shows that raw embeddings group by topic while association‑space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book‑concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi‑epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.

Authors:Kaiyang Li, Shihao Ji, Zhipeng Cai, Wei Li
Title: Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning
Abstract:
Approximate subgraph matching (ASM) is a task that determines the approximate presence of a given query graph in a large target graph. Being an NP‑hard problem, ASM is critical in graph analysis with a myriad of applications ranging from database systems and network science to biochemistry and privacy. Existing techniques often employ heuristic search strategies, which cannot fully utilize the graph information, leading to sub‑optimal solutions. This paper proposes a Reinforcement Learning based Approximate Subgraph Matching (RL‑ASM) algorithm that exploits graph transformers to effectively extract graph representations and RL‑based policies for ASM. Our model is built upon the branch‑and‑bound algorithm that selects one pair of nodes from the two input graphs at a time for potential matches. Instead of using heuristics, we exploit a Graph Transformer architecture to extract feature representations that encode the full graph information. To enhance the training of the RL policy, we use supervised signals to guide our agent in an imitation learning stage. Subsequently, the policy is fine‑tuned with the Proximal Policy Optimization (PPO) that optimizes the accumulative long‑term rewards over episodes. Extensive experiments on both synthetic and real‑world datasets demonstrate that our RL‑ASM outperforms existing methods in terms of effectiveness and efficiency. Our source code is available at https://github.com/KaiyangLi1992/RL‑ASM.

Authors:Haocheng Luo, Zehang Deng, Thanh-Toan Do, Mehrtash Harandi, Dinh Phung, Trung Le
Title: Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
Abstract:
Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate‑wise dynamics in logit space. Our analysis reveals that negative‑gradient updates cause residuals to expand rapidly along high‑curvature directions, which underlies the squeezing effect, whereas Sharpness‑Aware Minimization (SAM) can suppress this behavior through its curvature‑regularization effect. Building on this insight, we investigate logits‑SAM, a computationally efficient variant that perturbs only the output layer with negligible overhead. Extensive experiments on Pythia‑2.8B, Mistral‑7B, and Gemma‑2B‑IT across multiple datasets and benchmarks demonstrate that logits‑SAM consistently improves the effectiveness of DPO and integrates seamlessly with other DPO variants. Code is available at https://github.com/RitianLuo/logits‑sam‑dpo.

Authors:Naoki Morihira, Amal Nahar, Kartik Bharadwaj, Yasuhiro Kato, Akinobu Hayashi, Tatsuya Harada
Title: R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation
Abstract:
A central challenge in image‑based Model‑Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction‑based methods often waste capacity on large task‑irrelevant regions. Decoder‑free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2‑Dreamer, a decoder‑free MBRL framework with a self‑supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy‑reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta‑World, R2‑Dreamer is competitive with strong baselines such as DreamerV3 and TD‑MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC‑Subtle with tiny task‑relevant objects. These results suggest that an effective internal regularizer can enable versatile, high‑performance decoder‑free MBRL. Code is available at https://github.com/NM512/r2dreamer.

Authors:Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed
Title: Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
Abstract:
Recent adapter‑based CLIP tuning (e.g., Tip‑Adapter) is a strong few‑shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni‑modal feature vectors, overlooking fine‑grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training‑only framework. Instead of altering the lightweight adapter, we construct a high‑capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi‑scale visual patches and text prompts into a unified graph, (ii) performs deep cross‑modal reasoning via a Modality‑aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high‑fidelity class features. Crucially, we employ a cache‑aware dual‑objective strategy to supervise this relational knowledge directly into the Tip‑Adapter's key‑value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip‑Adapter with zero extra latency or memory. Across standard 1‑16‑shot benchmarks, our method consistently establishes a new state‑of‑the‑art. Ablations confirm that the auxiliary graph supervision, text‑guided reasoning, and node filtering are the essential ingredients for robust few‑shot adaptation. Code is available at https://github.com/MR‑Sherif/TOGA.git.

Authors:Alexander D. Goldie, Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz, Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O'Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach, Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster
Title: Procedural Generation of Algorithm Discovery Tasks in Machine Learning
Abstract:
Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open‑source at https://github.com/AlexGoldie/discogen.

Authors:Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Title: CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution
Abstract:
Label‑free reinforcement learning enables large language models to improve reasoning capabilities without ground‑truth supervision, typically by treating majority‑voted answers as pseudo‑labels. However, we identify a critical failure mode: as training maximizes self‑consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self‑consistent errors from pseudo‑labels. This co‑evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label‑free baselines by 4.7‑5.9% on mathematical reasoning benchmarks. Moreover, self‑verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co‑evolve.

Authors:Binqing Wu, Zongjiang Shang, Shiyu Liu, Jianlong Huang, Jiahui Xu, Ling Chen
Title: AirDDE: Multifactor Neural Delay Differential Equations for Air Quality Forecasting
Abstract:
Accurate air quality forecasting is essential for public health and environmental sustainability, but remains challenging due to the complex pollutant dynamics. Existing deep learning methods often model pollutant dynamics as an instantaneous process, overlooking the intrinsic delays in pollutant propagation. Thus, we propose AirDDE, the first neural delay differential equation framework in this task that integrates delay modeling into a continuous‑time pollutant evolution under physical guidance. Specifically, two novel components are introduced: (1) a memory‑augmented attention module that retrieves globally and locally historical features, which can adaptively capture delay effects modulated by multifactor data; and (2) a physics‑guided delay evolving function, grounded in the diffusion‑advection equation, that models diffusion, delayed advection, and source/sink terms, which can capture delay‑aware pollutant accumulation patterns with physical plausibility. Extensive experiments on three real‑world datasets demonstrate that AirDDE achieves the state‑of‑the‑art forecasting performance with an average MAE reduction of 8.79% over the best baselines. The code is available at https://github.com/w2obin/airdde‑aaai.

Authors:Truong-Son Hy
Title: Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization
Abstract:
We propose Q‑BIOLAT, a framework for modeling and optimizing protein fitness landscapes in binary latent spaces. Starting from protein sequences, we leverage pretrained protein language models to obtain continuous embeddings, which are then transformed into compact binary latent representations. In this space, protein fitness is approximated using a quadratic unconstrained binary optimization (QUBO) model, enabling efficient combinatorial search via classical heuristics such as simulated annealing and genetic algorithms. On the ProteinGym benchmark, we demonstrate that Q‑BIOLAT captures meaningful structure in protein fitness landscapes and enables the identification of high‑fitness variants. Despite using a simple binarization scheme, our method consistently retrieves sequences whose nearest neighbors lie within the top fraction of the training fitness distribution, particularly under the strongest configurations. We further show that different optimization strategies exhibit distinct behaviors, with evolutionary search performing better in higher‑dimensional latent spaces and local search remaining competitive in preserving realistic sequences. Beyond its empirical performance, Q‑BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization. By formulating protein fitness as a QUBO problem, our framework is directly compatible with emerging quantum annealing hardware, opening new directions for quantum‑assisted protein engineering. Our implementation is publicly available at: https://github.com/HySonLab/Q‑BIOLAT

Authors:Tynan Perez, Rafael Gomez-Bombarelli
Title: Self-Conditioned Denoising for Atomistic Representation Learning
Abstract:
The success of large‑scale pretraining in NLP and computer vision has catalyzed growing efforts to develop analogous foundation models for the physical sciences. However, pretraining strategies using atomistic data remain underexplored. To date, large‑scale supervised pretraining on DFT force‑energy labels has provided the strongest performance gains to downstream property prediction, out‑performing existing methods of self‑supervised learning (SSL) which remain limited to ground‑state geometries, and/or single domains of atomistic data. We address these shortcomings with Self‑Conditioned Denoising (SCD), a backbone‑agnostic reconstruction objective that utilizes self‑embeddings for conditional denoising across any domain of atomistic data, including small molecules, proteins, periodic materials, and 'non‑equilibrium' geometries. When controlled for backbone architecture and pretraining dataset, SCD significantly outperforms previous SSL methods on downstream benchmarks and matches or exceeds the performance of supervised force‑energy pretraining. We show that a small, fast GNN pretrained by SCD can achieve competitive or superior performance to larger models pretrained on significantly larger labeled or unlabeled datasets, across tasks in multiple domains. Our code is available at: https://github.com/TyJPerez/SelfConditionedDenoisingAtoms

Authors:Sophie Kearney, Shu Yang, Zixuan Wen, Weimin Lyu, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Chao Chen, Li Shen
Title: Tabular LLMs for Interpretable Few-Shot Alzheimer's Disease Prediction with Multimodal Biomedical Data
Abstract:
Accurate diagnosis of Alzheimer's disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few‑shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP‑GPT Tabular Alzheimer's Prediction GPT, a domain‑adapted tabular LLM framework built on TableGPT2 and fine‑tuned for few‑shot AD classification using tabular prompts rather than plain texts. We evaluate TAP‑GPT across four ADNI‑derived datasets, including QT‑PAD biomarkers and region‑level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP‑GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few‑shot setting while remaining competitive with state‑of‑the‑art general‑purpose LLMs. We show that feature selection mitigates degradation in high‑dimensional inputs and that TAP‑GPT maintains stable performance under simulated and real‑world missingness without imputation. Additionally, TAP‑GPT produces structured, modality‑aware reasoning aligned with established AD biology and shows greater stability under self‑reflection, supporting its use in iterative multi‑agent systems. To our knowledge, this is the first systematic application of a tabular‑specialized LLM to multimodal biomarker‑based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM‑driven multi‑agent clinical decision‑support systems. The source code is publicly available on GitHub: https://github.com/sophie‑kearney/TAP‑GPT.

Authors:Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Title: MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
Abstract:
Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta‑learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill‑driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient‑based updates via cloud LoRA fine‑tuning and Reinforcement Learning with a Process Reward Model (RL‑PRM). This is triggered during user‑inactive windows by the Opportunistic Meta‑Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher‑quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy‑based architecture, MetaClaw scales to production‑size LLMs without local GPUs. Experiments on MetaClaw‑Bench and AutoResearchClaw show that skill‑driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi‑K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming‑lab/MetaClaw.

Authors:Yasaswini Chebolu
Title: DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems
Abstract:
Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off‑road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road‑scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off‑road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT‑B2) backbone. The system classifies terrain into ten ecologically meaningful categories ‑‑ Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky ‑‑ enabling safety‑aware path planning for ground robots and autonomous vehicles. Trained on a purpose‑built dataset of 4,176 annotated off‑road images at 512x512 resolution, DesertFormer achieves a mean Intersection‑over‑Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns ‑‑ Ground Clutter to Landscape and Dry Grass to Landscape ‑‑ and propose class‑weighted training and copy‑paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini‑ch/Vision‑based‑Desert‑Terrain‑Segmentation‑using‑SegFormer.

Authors:Martin G. Frasch
Title: Minimum-Action Learning: Energy-Constrained Symbolic Model Selection for Physical Law Identification from Noisy Data
Abstract:
Identifying physical laws from noisy observational data is a central challenge in scientific machine learning. We present Minimum‑Action Learning (MAL), a framework that selects symbolic force laws from a pre‑specified basis library by minimizing a Triple‑Action functional combining trajectory reconstruction, architectural sparsity, and energy‑conservation enforcement. A wide‑stencil acceleration‑matching technique reduces noise variance by 10,000x, transforming an intractable problem (SNR ~0.02) into a learnable one (SNR ~1.6); this preprocessing is the critical enabler shared by all methods tested, including SINDy variants. On two benchmarks ‑‑ Kepler gravity and Hooke's law ‑‑ MAL recovers the correct force law with Kepler exponent p = 3.01 +/‑ 0.01 at ~0.07 kWh (40% reduction vs. prediction‑error‑only baselines). The raw correct‑basis rate is 40% for Kepler and 90% for Hooke; an energy‑conservation‑based criterion discriminates the true force law in all cases, yielding 100% pipeline‑level identification. Basis library sensitivity experiments show that near‑confounders degrade selection (20% with added r^‑2.5 and r^‑1.5), while distant additions are harmless, and the conservation diagnostic remains informative even when the correct basis is absent. Direct comparison with noise‑robust SINDy variants, Hamiltonian Neural Networks, and Lagrangian Neural Networks confirms MAL's distinct niche: interpretable, energy‑constrained model selection that combines symbolic basis identification with dynamical rollout validation.

Authors:Vladimer Khasia
Title: HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling
Abstract:
Sequence modeling universally relies on discrete subword tokenization to circumvent the \mathcalO(N^2) computational intractability of native byte‑level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce HoloByte: a strictly tokenizer‑free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed‑capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension‑preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from \mathcalO(N^2D) to \mathcalO\left( \fracN^2W^2D + ND^2 \right). A localized causal micro‑decoder subsequently unbinds these representations to compute exact byte‑level distributions. To govern this continuous trajectory, we propose a dual‑objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension D = Ω(W \ln |\mathcalV|) required to ensure error‑free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte‑Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary‑invariant sequence modeling. The code is available at https://github.com/VladimerKhasia/HoloByte

Authors:Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi
Title: Efficient Reasoning on the Edge
Abstract:
Large language models (LLMs) with chain‑of‑thought reasoning achieve state‑of‑the‑art performance across complex problem‑solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV‑cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on‑device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine‑tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory‑bound decoding, we exploit parallel test‑time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter‑switching mechanism that activates reasoning only when needed and a KV‑cache sharing strategy during prompt encoding, reducing time‑to‑first‑token for on‑device inference. Experiments on Qwen2.5‑7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

Authors:Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan-ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, Ping Luo
Title: ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
Abstract:
Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data‑generation‑ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data‑generation‑ready digital object twins. Our pipeline transforms a single image into simulation‑ready and semantically annotated 3D asset, enabling large‑scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin‑100K, a dataset containing 100K high‑quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin‑100K offers high‑quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.

Authors:Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys, Elouan Vuichard, Jean-Michel Loubes
Title: Probing Cultural Signals in Large Language Models through Author Profiling
Abstract:
Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero‑shot setting, inferring singers' gender and ethnicity without task‑specific fine‑tuning. Across several open‑source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non‑trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek‑1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models' prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral‑8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma‑12B shows the most balanced behavior. Our code is available on [GitHub](https://github.com/ValentinLafargue/CulturalProbingLLM) and results on [HuggingFace](https://huggingface.co/datasets/ValentinLAFARGUE/AuthorProfilingResults).

Authors:Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li
Title: Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
Abstract:
Reasoning‑focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi‑hop QA benchmarks lack step‑level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open‑domain multi‑hop QA resource that provides decomposed sub‑questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine‑generated training examples (OmanicSynth) and 967 expert‑reviewed human‑annotated evaluation examples (OmanicBench). Systematic evaluations show that state‑of‑the‑art LLMs achieve only 73.11% multiple‑choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine‑tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning‑capability transfer. We release the data at https://huggingface.co/datasets/li‑lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

Authors:Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, Renyang Liu
Title: REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models
Abstract:
Recent progress in image generation models (IGMs) enables high‑fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image‑side threats in black‑box settings, remains underexplored. To bridge this gap, we present REFORGE, a black‑box red‑teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke‑based images and optimizes perturbations with a cross‑attention‑guided masking strategy that allocates noise to concept‑relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness‑aware unlearning against multi‑modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.

Authors:Yikai Gu, Lele Cao, Bo Zhao, Lei Lei, Lei You
Title: DISCOVER: A Solver for Distributional Counterfactual Explanations
Abstract:
Counterfactual explanations (CE) explain model decisions by identifying input modifications that lead to different predictions. Most existing methods operate at the instance level. Distributional Counterfactual Explanations (DCE) extend this setting by optimizing an optimal transport objective that balances proximity to a factual input distribution and alignment to a target output distribution, with statistical certification via chance constrained bounds. However, DCE relies on gradient based optimization, while many real‑world tabular pipelines are dominated by non‑differentiable models. We propose DISCOVER, a model‑agnostic solver for distributional counterfactual explanations. DISCOVER preserves the original DCE objective and certification while replacing gradient descent with a sparse propose‑and‑select search paradigm. It exploits a sample‑wise decomposition of the transport objective to compute per‑row impact scores and enforce a top‑k intervention budget, focusing edits on the most influential samples. To guide candidate generation without predictor gradients, DISCOVER introduces an OT‑guided cone sampling primitive driven by input‑side transport geometry. Experiments on multiple tabular datasets demonstrate strong joint alignment of input and output distributions, extending distributional counterfactual reasoning to modern black box learning pipelines. A code repository is available at https://github.com/understanding‑ml/DCE.

Authors:Hanif Rahman
Title: PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
Abstract:
We present PashtoCorp, a 1.25‑billion‑word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose‑built web scrapers, processed through a reproducible pipeline with Arabic‑script tokenization, SHA‑256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM‑R‑base on PashtoCorp reduces held‑out perplexity by 25.1% (8.08‑>6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%‑>21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma‑3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave‑one‑out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto‑corpus, https://huggingface.co/ihanif/xlmr‑pashto, and https://github.com/ihanif/pashto‑corpus.

Authors:Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan
Title: Decoding the Critique Mechanism in Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs) exhibit backtracking and self‑verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong "critique" ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating through the chain‑of‑thought (CoT), resulting in an incorrect intermediate conclusion, the model still reaches the correct final answer. This recovery implies that the model must possess an internal mechanism to detect errors and trigger self‑correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test‑time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self‑verification mechanism. Our code is available at https://github.com/mail‑research/lrm‑critique‑vectors.

Authors:Moonsoo Park, Seulbeen Je, Donghyeon Park
Title: ReFORM: Review-aggregated Profile Generation via LLM with Multi-Factor Attention for Restaurant Recommendation
Abstract:
In recommender systems, large language models (LLMs) have gained popularity for generating descriptive summarization to improve recommendation robustness, along with Graph Convolution Networks. However, existing LLM‑enhanced recommendation studies mainly rely on the internal knowledge of LLMs about item titles while neglecting the importance of various factors influencing users' decisions. Although information reflecting various decision factors of each user is abundant in reviews, few studies have actively exploited such insights for recommendation. To address these limitations, we propose a ReFORM: Review‑aggregated Profile Generation via LLM with Multi‑FactOr Attentive RecoMmendation framework. Specifically, we first generate factor‑specific user and item profiles from reviews using LLM to capture a user's preference by items and an item's evaluation by users. Then, we propose a Multi‑Factor Attention to highlight the most influential factors in each user's decision‑making process. In this paper, we conduct experiments on two restaurant datasets of varying scales, demonstrating its robustness and superior performance over state‑of‑the‑art baselines. Furthermore, in‑depth analyses validate the effectiveness of the proposed modules and provide insights into the sources of personalization. Our source code and datasets are available at https://github.com/m0onsoo/ReFORM.

Authors:Karen Sargsyan
Title: Functorial Neural Architectures from Higher Inductive Types
Abstract:
Neural networks systematically fail at compositional generalization ‑‑ producing correct outputs for novel combinations of known parts. We show that this failure is architectural: compositional generalization is equivalent to functoriality of the decoder, and this perspective yields both guarantees and impossibility results. We compile Higher Inductive Type (HIT) specifications into neural architectures via a monoidal functor from the path groupoid of a target space to a category of parametric maps: path constructors become generator networks, composition becomes structural concatenation, and 2‑cells witnessing group relations become learned natural transformations. We prove that decoders assembled by structural concatenation of independently generated segments are strict monoidal functors (compositional by construction), while softmax self‑attention is not functorial for any non‑trivial compositional task. Both results are formalized in Cubical Agda. Experiments on three spaces validate the full hierarchy: on the torus (\mathbbZ^2), functorial decoders outperform non‑functorial ones by 2‑2.7x; on S^1 \vee S^1 (F_2), the type‑A/B gap widens to 5.5‑10x; on the Klein bottle (\mathbbZ \rtimes \mathbbZ), a learned 2‑cell closes a 46% error gap on words exercising the group relation.

Authors:Shin'ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa
Title: Parallel In-context Learning for Large Vision Language Models
Abstract:
Large vision‑language models (LVLMs) employ multi‑modal in‑context learning (MM‑ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade‑off, we propose Parallel In‑Context Learning (Parallel‑ICL), a plug‑and‑play inference algorithm. Parallel‑ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product‑of‑Experts (PoE) ensemble to approximate the full‑context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel‑ICL: (i) clustering‑based context chunking to maximize inter‑chunk diversity and (ii) similarity‑based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel‑ICL achieves performance comparable to full‑context MM‑ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy‑efficiency trade‑off in MM‑ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

Authors:Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan
Title: MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models
Abstract:
Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub‑tokens and models the diffusion process at the sub‑token level. We identify two limitations of the MDM‑Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte‑Pair‑Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM‑Prime and develop MDM‑Prime‑v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM‑Prime‑v2 is 21.8× more compute‑efficient than autoregressive models (ARM). In compute‑optimal comparisons, MDM‑Prime‑v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM‑Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero‑shot accuracy on various commonsense reasoning tasks.

Authors:Yiqun T. Chen, Moran Guo, Shengy Li
Title: Power Analysis for Prediction-Powered Inference
Abstract:
Modern studies increasingly leverage outcomes predicted by machine learning and artificial intelligence (AI/ML) models, and recent work, such as prediction‑powered inference (PPI), has developed valid downstream statistical inference procedures. However, classical power and sample size formulas do not readily account for these predictions. In this work, we tackle a simple yet practical question: given a new AI/ML model with high predictive power, how many labeled samples are needed to achieve a desired level of statistical power? We derive closed‑form power formulas by characterizing the asymptotic variance of the PPI estimator and applying Wald test inversion to obtain the required labeled sample size. Our results cover widely used settings including two‑sample comparisons and risk measures in 2x2 tables. We find that a useful rule of thumb is that the reduction in required labeled samples relative to classical designs scales roughly with the R2 between the predictions and the ground truth. Our analytical formulas are validated using Monte Carlo simulations, and we illustrate the framework in three contemporary biomedical applications spanning single‑cell transcriptomics, clinical blood pressure measurement, and dermoscopy imaging. We provide our software as an R package and online calculators at https://github.com/yiqunchen/pppower.

Authors:Yifan Zhang
Title: Residual Stream Duality in Modern Transformer Architectures
Abstract:
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two‑axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self‑attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth‑wise residual attention read is exactly the same local operator as causal short sliding‑window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer^2. This perspective also clarifies the recent literature. ELC‑BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention‑based routing over earlier layers. The key point, however, is that operator‑level duality does not imply systems‑level symmetry. For large‑scale autoregressive models, sequence‑axis ShortSWA is usually the more hardware‑friendly placement because it reuses token‑side sliding‑window kernels, KV‑cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross‑layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence‑axis ShortSWA when the goal is local adaptive mixing.

Authors:Sijie Li, Biao Qian, Jungong Han
Title: Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
Abstract:
Network pruning is an effective technique for enabling lightweight Large Vision‑Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality‑specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text‑Visual Weight Pruning method for LVLMs, dubbed ATV‑Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV‑Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer‑adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV‑Pruning over state‑of‑the‑art methods.

Authors:Xiaolong Han, Ferrante Neri, Zijian Jiang, Fang Wu, Yanfang Ye, Lu Yin, Zehong Wang
Title: W2T: LoRA Weights Already Know What They Can Do
Abstract:
Each LoRA checkpoint compactly stores task‑specific updates in low‑rank weight matrices, offering an efficient way to adapt large language models to new tasks and domains. In principle, these weights already encode what the adapter does and how well it performs. In this paper, we ask whether this information can be read directly from the weights, without running the base model or accessing training data. A key obstacle is that a single LoRA update can be factorized in infinitely many ways. Without resolving this ambiguity, models trained on the factors may fit the particular factorization rather than the underlying update. To this end, we propose \methodfull, which maps each LoRA update to a provably canonical form via QR decomposition followed by SVD, so that all equivalent factorizations share the same representation. The resulting components are then tokenized and processed by a Transformer to produce a weight‑space embedding. Across language and vision LoRA collections, W2T achieves strong results on attribute classification, performance prediction, and adapter retrieval, demonstrating that LoRA weights reliably indicate model behavior once factorization ambiguity is removed. Code is available at https://github.com/xiaolonghan2000/Weight2Token.

Authors:Rushil Thareja, Gautam Gupta, Francesco Pinto, Nils Lukas
Title: MAC: Multi-Agent Constitution Learning
Abstract:
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM‑based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi‑Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human‑readable and auditable rule sets, and achieves performance comparable to supervised fine‑tuning and GRPO without requiring parameter updates.

Authors:Max Zimmer, Nico Pelleriti, Christophe Roux, Sebastian Pokutta
Title: The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning
Abstract:
AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Yet for many, it remains unclear how these tools fit into everyday research practice. This paper is a practical guide to AI‑assisted research in mathematics and machine learning: We discuss how researchers can use modern AI systems productively, where these systems help most, and what kinds of guardrails are needed to use them responsibly. It is organized into three parts: (I) a five‑level taxonomy of AI integration, (II) an open‑source framework that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framework runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal‑laptop prototyping to multi‑node, multi‑GPU experimentation across compute clusters. In practice, our longest autonomous session ran for over 20 hours, dispatching independent experiments across multiple nodes without human intervention. We stress that our framework is not intended to replace the researcher in the loop, but to augment them. Our code is publicly available at https://github.com/ZIB‑IOL/The‑Agentic‑Researcher.

Authors:Dibakar Sigdel, Namuna Panday
Title: PhasorFlow: A Python Library for Unit Circle Based Computing
Abstract:
We present PhasorFlow, an open‑source Python library introducing a computational paradigm operating on the S^1 unit circle. Inputs are encoded as complex phasors z = e^iθ on the N‑Torus (\mathbbT^N). As computation proceeds via unitary wave interference gates, global norm is preserved while individual components drift into \mathbbC^N, allowing algorithms to natively leverage continuous geometric gradients for predictive learning. PhasorFlow provides three core contributions. First, we formalize the Phasor Circuit model (N unit circle threads, M gates) and introduce a 22‑gate library covering Standard Unitary, Non‑Linear, Neuromorphic, and Encoding operations with full matrix algebra simulation. Second, we present the Variational Phasor Circuit (VPC), analogous to Variational Quantum Circuits (VQC), enabling optimization of continuous phase parameters for classical machine learning tasks. Third, we introduce the Phasor Transformer, replacing expensive QK^TV attention with a parameter‑free, DFT‑based token mixing layer inspired by FNet. We validate PhasorFlow on non‑linear spatial classification, time‑series prediction, financial volatility detection, and neuromorphic tasks including neural binding and oscillatory associative memory. Our results establish unit circle computing as a deterministic, lightweight, and mathematically principled alternative to classical neural networks and quantum circuits. It operates on classical hardware while sharing quantum mechanics' unitary foundations. PhasorFlow is available at https://github.com/mindverse‑computing/phasorflow.

Authors:Jakaria Rabbi, Nilanjan Ray, Dana Cobzas
Title: Self-supervised Disentanglement of Disease Effects from Aging in 3D Medical Shapes
Abstract:
Disentangling pathological changes from physiological aging in 3D medical shapes is crucial for developing interpretable biomarkers and patient stratification. However, this separation is challenging when diagnosis labels are limited or unavailable, since disease and aging often produce overlapping effects on shape changes, obscuring clinically relevant shape patterns. To address this challenge, we propose a two‑stage framework combining unsupervised disease discovery with self‑supervised disentanglement of implicit shape representations. In the first stage, we train an implicit neural model with signed distance functions to learn stable shape embeddings. We then apply clustering on the shape latent space, which yields pseudo disease labels without using ground‑truth diagnosis during discovery. In the second stage, we disentangle factors in a compact variational space using pseudo disease labels discovered in the first stage and the ground truth age labels available for all subjects. We enforce separation and controllability with a multi‑objective disentanglement loss combining covariance and a supervised contrastive loss. On ADNI hippocampus and OAI distal femur shapes, we achieve near‑supervised performance, improving disentanglement and reconstruction over state‑of‑the‑art unsupervised baselines, while enabling high‑fidelity reconstruction, controllable synthesis, and factor‑based explainability. Code and checkpoints are available at https://github.com/anonymous‑submission01/medical‑shape‑disentanglement

Authors:Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang
Title: FlashSampling: Fast and Memory-Efficient Exact Sampling
Abstract:
Sampling from a categorical distribution is mathematically simple, but in large‑vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM‑head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile‑by‑tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because \argmax decomposes over a partition; grouped variants for online and tensor‑parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel‑level decode workloads, and in end‑to‑end vLLM experiments, it reduces time per output token by up to 19% on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth‑bound postprocessing step into a lightweight epilogue. Project Page: https://github.com/FlashSampling/FlashSampling.

Authors:Ye Wang, Zixuan Wu, Lifeng Shen, Jiang Xie, Xiaoling Wang, Hong Yu, Guoyin Wang
Title: Mastering the Minority: An Uncertainty-guided Multi-Expert Framework for Challenging-tailed Sequence Learning
Abstract:
Imbalanced data distribution remains a critical challenge in sequential learning, leading models to easily recognize frequent categories while failing to detect minority classes adequately. The Mixture‑of‑Experts model offers a scalable solution, yet its application is often hindered by parameter inefficiency, poor expert specialization, and difficulty in resolving prediction conflicts. To Master the Minority classes effectively, we propose the Uncertainty‑based Multi‑Expert fusion network (UME) framework. UME is designed with three core innovations: First, we employ Ensemble LoRA for parameter‑efficient modeling, significantly reducing the trainable parameter count. Second, we introduce Sequential Specialization guided by Dempster‑Shafer Theory (DST), which ensures effective specialization on the challenging‑tailed classes. Finally, an Uncertainty‑Guided Fusion mechanism uses DST's certainty measures to dynamically weigh expert opinions, resolving conflicts by prioritizing the most confident expert for reliable final predictions. Extensive experiments across four public hierarchical text classification datasets demonstrate that UME achieves state‑of‑the‑art performance. We achieve a performance gain of up to 17.97% over the best baseline on individual categories, while reducing trainable parameters by up to 10.32%. The findings highlight that uncertainty‑guided expert coordination is a principled strategy for addressing challenging‑tailed sequence learning. Our code is available at https://github.com/CQUPTWZX/Multi‑experts.

Authors:Paras Sharma, Swastika Sharma
Title: Flood Risk Follows Valleys, Not Grids: Graph Neural Networks for Flash Flood Susceptibility Mapping in Himachal Pradesh with Conformal Uncertainty Quantification
Abstract:
Flash floods are the most destructive natural hazard in Himachal Pradesh (HP), India, causing over 400 fatalities and 1.2 billion in losses in the 2023 monsoon season alone. Existing risk maps treat every pixel independently, ignoring the basic fact that flooding upstream raises risk downstream. We address this with a Graph Neural Network (GraphSAGE) trained on a watershed connectivity graph (460 sub‑watersheds, 1,700 directed edges), built from a six‑year Sentinel‑1 SAR flood inventory (2018‑2023, 3,000 events) and 12 environmental variables at 30 m resolution. Four pixel‑based ML models (RF, XGBoost, LightGBM, stacking ensemble) serve as baselines. All models are evaluated with leave‑one‑basin‑out spatial cross‑validation to avoid the 5‑15% AUC inflation of random splits. Conformal prediction produces the first HP susceptibility maps with statistically guaranteed 90% coverage intervals. The GNN achieved AUC = 0.978 +/‑ 0.017, outperforming the best baseline (AUC = 0.881) and the published HP benchmark (AUC = 0.88). The +0.097 gain confirms that river connectivity carries predictive signal that pixel‑based models miss. High‑susceptibility zones overlap 1,457 km of highways (including 217 km of the Manali‑Leh corridor), 2,759 bridges, and 4 major hydroelectric installations. Conformal intervals achieved 82.9% empirical coverage on the held‑out 2023 test set; lower coverage in high‑risk zones (45‑59%) points to SAR label noise as a target for future work.

Authors:Zeyu Zhang, Rui Li, Xiaoyan Zhao, Yang Zhang, Wenjie Wang, Xu Chen, Tat-Seng Chua
Title: NextMem: Towards Latent Factual Memory for LLM-based Agents
Abstract:
Memory is critical for LLM‑based agents to preserve past observations for future decision‑making, where factual memory serves as its foundational part. However, existing approaches to constructing factual memory face several limitations. Textual methods impose heavy context and indexing burdens, while parametric methods suffer from catastrophic forgetting and high costs. To address these challenges, we introduce NextMem, a latent factual memory framework that utilizes an autoregressive autoencoder to efficiently construct latent memory while ensuring accurate reconstruction. For better optimization, we propose a two‑stage training process, including autoregressive reconstruction alignment and progressive latent substitution. We also incorporate quantization to reduce storage overhead. Extensive experiments demonstrate that NextMem achieves superior performance, and excels in retrieval, robustness, and extensibility properties. We release our code and model checkpoints at https://github.com/nuster1128/NextMem.

Authors:Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang
Title: AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
Abstract:
Existing video‑to‑audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro‑acoustic features. These bottlenecks make it difficult to perform fine‑grained sound synthesis using text‑controlled modes. To address these limitations, we propose AC‑Foley, an audio‑conditioned V2A model that directly leverages reference audio to achieve precise and fine‑grained control over generated sounds. This approach enables fine‑grained sound synthesis, timbre transfer, zero‑shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC‑Foley achieves state‑of‑the‑art performance for Foley generation when conditioned on reference audio, while remaining competitive with state‑of‑the‑art video‑to‑audio methods even without audio conditioning. Code and demo are available at: https://ff2416.github.io/AC‑Foley‑Page

Authors:Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa
Title: Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Abstract:
Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image‑based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero‑shot approaches, which avoid synthetic data and instead score content against real‑data statistics, enabling training‑free, model‑agnostic detection. We introduce STALL, a simple, training‑free, theoretically justified detector that provides likelihood‑based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state‑of‑the‑art generative models. STALL consistently outperforms prior image‑ and video‑based baselines. Code and data are available at https://omerbenhayun.github.io/stall‑video.

Authors:Ziqing Ma, Kai Ying, Xinyue Gu, Tian Zhou, Tianyu Zhu, Haifan Zhang, Peisong Niu, Wang Zheng, Cong Bai, Liang Sun
Title: Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
Abstract:
Accurate day‑ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine‑scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We propose Baguan‑solar, a two‑stage multimodal framework that fuses forecasts from Baguan, a global weather foundation model, with high‑resolution geostationary satellite imagery to produce 24‑ hour irradiance forecasts at kilometer scale. Its decoupled two‑stage design first forecasts day‑night continuous intermediates (e.g., cloud cover) and then infers irradiance, while its modality fusion jointly preserves fine‑scale cloud structures from satellite and large‑scale constraints from Baguan forecasts. Evaluated over East Asia using CLDAS as ground truth, Baguan‑solar outperforms strong baselines (including ECMWF IFS, vanilla Baguan, and SolarSeer), reducing RMSE by 16.08% and better resolving cloud‑induced transients. An operational deployment of Baguan‑solar has supported solar power forecasting in an eastern province in China, since July 2025. Our code is accessible at https://github.com/DAMO‑DI‑ML/Baguansolar. git.

Authors:Tuan-Anh Yang, Bao V. Q. Bui, Chanh-Quang Vo-Van, Truong-Son Hy
Title: Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis
Abstract:
We propose a deep learning framework for COVID‑19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice‑level and volumetric information. The 2.5D branch processes multi‑view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet‑18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross‑source robustness. Predictions from both branches are combined through logit‑level ensemble inference. Experiments on the PHAROS‑AIF‑MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID‑19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1‑score, outperforming both individual models, while for multi‑class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1‑score. These results highlight the benefit of combining pretrained slice‑based representations with volumetric modeling for robust multi‑source medical imaging analysis. Code is available at https://github.com/HySonLab/PHAROS‑AIF‑MIH

Authors:Xuanfei Ren, Allen Nie, Tengyang Xie, Ching-An Cheng
Title: POLCA: Stochastic Generative Optimization with LLM
Abstract:
Optimizing complex systems, ranging from LLM prompts to multi‑turn agents, traditionally requires labor‑intensive manual iteration. We formalize this challenge as a stochastic generative optimization problem where a generative language model acts as the optimizer, guided by numerical rewards and text feedback to discover the best system. We introduce Prioritized Optimization with Local Contextual Aggregation (POLCA), a scalable framework designed to handle stochasticity in optimization ‑‑ such as noisy feedback, sampling minibatches, and stochastic system behaviors ‑‑ while effectively managing the unconstrained expansion of solution space. POLCA maintains a priority queue to manage the exploration‑exploitation tradeoff, systematically tracking candidate solutions and their evaluation histories. To enhance efficiency, we integrate an \varepsilon‑Net mechanism to maintain parameter diversity and an LLM Summarizer to perform meta‑learning across historical trials. We theoretically prove that POLCA converges to near‑optimal candidate solutions under stochasticity. We evaluate our framework on diverse benchmarks, including τ‑bench, HotpotQA (agent optimization), VeriBench (code translation) and KernelBench (CUDA kernel generation). Experimental results demonstrate that POLCA achieves robust, sample and time‑efficient performance, consistently outperforming state‑of‑the‑art algorithms in both deterministic and stochastic problems. The codebase for this work is publicly available at https://github.com/rlx‑lab/POLCA.

Authors:Yu Hao, Qiuyu Wang, Cheng Yang, Yawen Li, Zhiqiang Zhang, Chuan Shi
Title: GNNVerifier: Graph-based Verifier for LLM Task Planning
Abstract:
Large language models (LLMs) facilitate the development of autonomous agents. As a core component of such agents, task planning aims to decompose complex natural language requests into concrete, solvable sub‑tasks. Since LLM‑generated plans are frequently prone to hallucinations and sensitive to long‑context prom‑pts, recent research has introduced plan verifiers to identify and correct potential flaws. However, most existing approaches still rely on an LLM as the verifier via additional prompting for plan review or self‑reflection. LLM‑based verifiers can be misled by plausible narration and struggle to detect failures caused by structural relations across steps, such as type mismatches, missing intermediates, or broken dependencies. To address these limitations, we propose a graph‑based verifier for LLM task planning. Specifically, the proposed method has four major components: Firstly, we represent a plan as a directed graph with enriched attributes, where nodes denote sub‑tasks and edges encode execution order and dependency constraints. Secondly, a graph neural network (GNN) then performs structural evaluation and diagnosis, producing a graph‑level plausibility score for plan acceptance as well as node/edge‑level risk scores to localize erroneous regions. Thirdly, we construct controllable perturbations from ground truth plan graphs, and automatically generate training data with fine‑grained annotations. Finally, guided by the feedback from our GNN verifier, we enable an LLM to conduct local edits (e.g., tool replacement or insertion) to correct the plan when the graph‑level score is insufficient. Extensive experiments across diverse datasets, backbone LLMs, and planners demonstrate that our GNNVerifier achieves significant gains in improving plan quality. Our data and code is available at https://github.com/BUPT‑GAMMA/GNNVerifier.

Authors:Seunghan Lee, Jaehoon Lee, Jun Seo, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Title: Cross-RAG: Zero-Shot Retrieval-Augmented Time Series Forecasting via Cross-Attention
Abstract:
Recent advances in time series foundation models (TSFMs) demonstrate strong expressive capacity through large‑scale pretraining across diverse time series domains. Zero‑shot time series forecasting with TSFMs, however, exhibits limited generalization to unseen datasets, which retrieval‑augmented forecasting addresses by leveraging an external knowledge base. Existing approaches rely on a fixed number of retrieved samples that may introduce irrelevant information. To this end, we propose Cross‑RAG, a zero‑shot retrieval‑augmented forecasting framework that selectively attends to query‑relevant retrieved samples. Cross‑RAG models input‑level relevance between the query and retrieved samples via query‑retrieval cross‑attention, while jointly incorporating information from the query and retrieved samples. Extensive experiments demonstrate that Cross‑RAG consistently improves zero‑shot forecasting performance across various TSFMs and RAG methods, and additional analyses confirm its effectiveness across diverse retrieval scenarios. Code is available at https://github.com/seunghan96/cross‑rag/.

Authors:Salim Khazem
Title: AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers
Abstract:
Frozen‑backbone transfer with Vision Transformers faces two under‑addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low‑rank bottleneck whose up‑projection is zero‑initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early‑epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess‑risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi‑seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top‑1 accuracy over head‑only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine‑tuning, and outperforms full fine‑tuning on 10 of 15 dataset‑backbone pairs. Across the full benchmark, AdapterTune improves over head‑only transfer on every dataset‑backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune

Authors:Ping Chen, Xiang Liu, Xingpeng Zhang, Fei Shen, Xun Gong, Zhaoxiang Liu, Zezhou Chen, Huan Hu, Kai Wang, Shiguo Lian
Title: Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning
Abstract:
Diffusion models operate in a reflexive System 1 mode, constrained by a fixed, content‑agnostic sampling schedule. This rigidity arises from the curse of state dimensionality, where the combinatorial explosion of possible states in the high‑dimensional noise manifold renders explicit trajectory planning intractable and leads to systematic computational misallocation. To address this, we introduce Chain‑of‑Trajectories (CoTj), a train‑free framework enabling System 2 deliberative planning. Central to CoTj is Diffusion DNA, a low‑dimensional signature that quantifies per‑stage denoising difficulty and serves as a proxy for the high‑dimensional state space, allowing us to reformulate sampling as graph planning on a directed acyclic graph. Through a Predict‑Plan‑Execute paradigm, CoTj dynamically allocates computational effort to the most challenging generative phases. Experiments across multiple generative models demonstrate that CoTj discovers context‑aware trajectories, improving output quality and stability while reducing redundant computation. This work establishes a new foundation for resource‑aware, planning‑based diffusion modeling. The code is available at https://github.com/UnicomAI/CoTj.

Authors:Mike Amega
Title: EARCP: Self-Regulating Coherence-Aware Ensemble Architecture for Sequential Decision Making -- Ensemble Auto-Regule par Coherence et Performance
Abstract:
We present EARCP (Ensemble Auto‑Régulé par Cohérence et Performance), a novel ensemble architecture that dynamically weights heterogeneous expert models based on both their individual performance and inter‑model coherence. Unlike traditional ensemble methods that rely on static or offline‑learned combinations, EARCP continuously adapts model weights through a principled online learning mechanism that balances exploitation of high‑performing models with exploration guided by consensus signals. The architecture combines theoretical foundations from multiplicative weight update algorithms with a novel coherence‑based regularization term, providing both theoretical guarantees through regret bounds and practical robustness in non‑stationary environments. We formalize the EARCP framework, prove sublinear regret bounds of O(sqrt(T log M)) under standard assumptions, and demonstrate its effectiveness through empirical evaluation on sequential prediction tasks including time series forecasting, activity recognition, and financial prediction. The architecture is designed as a general‑purpose framework applicable to any domain requiring ensemble learning with temporal dependencies. An open‑source implementation is available at https://github.com/Volgat/earcp and via PyPI (pip install earcp).

Authors:Varun Pratap Bhardwaj
Title: SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory
Abstract:
Persistent memory is a central capability for AI agents, yet the mathematical foundations of memory retrieval, lifecycle management, and consistency remain unexplored. Current systems employ cosine similarity for retrieval, heuristic decay for salience, and provide no formal contradiction detection. We establish information‑geometric foundations through three contributions. First, a retrieval metric derived from the Fisher information structure of diagonal Gaussian families, satisfying Riemannian metric axioms, invariant under sufficient statistics, and computable in O(d) time. Second, memory lifecycle formulated as Riemannian Langevin dynamics with proven existence and uniqueness of the stationary distribution via the Fokker‑Planck equation, replacing hand‑tuned decay with principled convergence guarantees. Third, a cellular sheaf model where non‑trivial first cohomology classes correspond precisely to irreconcilable contradictions across memory contexts. On the LoCoMo benchmark, the mathematical layers yield +12.7 percentage points over engineering baselines across six conversations, reaching +19.9 pp on the most challenging dialogues. A four‑channel retrieval architecture achieves 75% accuracy without cloud dependency. Cloud‑augmented results reach 87.7%. A zero‑LLM configuration satisfies EU AI Act data sovereignty requirements by architectural design. To our knowledge, this is the first work establishing information‑geometric, sheaf‑theoretic, and stochastic‑dynamical foundations for AI agent memory systems.

Authors:Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng
Title: From $\boldsymbol{\logπ}$ to $\boldsymbolπ$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log‑probability gradient (\nabla_θ\log π_θ) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient (\nabla_θπ_θ) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek‑R1‑Distill‑Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/VenomRose‑Juri/DGPO‑RL.

Authors:Jaeyo Shin, Jiwook Kim, Hyunjung Shim
Title: Representation Alignment for Just Image Transformers is not Easier than You Think
Abstract:
Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel‑space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT‑B/16 and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet 256 × 256, while achieving > 2× faster convergence. Finally, PixelREPA‑H/16 achieves FID=1.81 and IS=317.2. Our code is available at https://github.com/kaist‑cvml/PixelREPA.

Authors:Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon, Ki Seong Lee, Youngchae Lee, Muhan Yeo, Edward Choi
Title: ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
Abstract:
While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step‑by‑step reasoning or just rely on superficial visual cues. To investigate this, we introduce ECG‑Reasoning‑Benchmark, a novel multi‑turn evaluation framework comprising over 6,400 samples to systematically assess step‑by‑step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state‑of‑the‑art models reveals a critical failure in executing multi‑step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near‑zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning‑centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg‑reasoning‑benchmark.

Authors:He Zhang, Ying Sun, Hui Xiong
Title: GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies
Abstract:
Flow‑matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi‑modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one‑step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow‑matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q‑guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q‑guided prior modeled by a conditional VAE. This state‑conditioned prior repositions the starting points of the one‑step generation process into high‑Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor‑critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state‑of‑the‑art approaches. Code will be available at https://github.com/ZhHe11/GSFlow‑RL.

Authors:Yutong Wu, Chenrui Cao, Pengwei Jin, Di Huang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu
Title: QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis
Abstract:
SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general‑purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high‑quality real‑world SVA corpora and the lack of reliable methods to determine NL‑SVA semantic equivalence. For the former, large‑scale open‑source RTLs are used to guide LLMs to generate real‑world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV‑SVA, a series of SVA generation models. Notably, CodeV‑SVA‑14B achieves 75.8% on NL2SVA‑Human and 84.0% on NL2SVA‑Machine in Func.@1, matching or exceeding advanced LLMs like GPT‑5 and DeepSeek‑R1.

Authors:Huan Wang, Jun Shen, Jun Yan, Guansong Pang
Title: Domain-Skewed Federated Learning with Feature Decoupling and Calibration
Abstract:
Federated learning (FL) allows distributed clients to collaboratively train a global model in a privacy‑preserving manner. However, one major challenge is domain skew, where clients' data originating from diverse domains may hinder the aggregated global model from learning a consistent representation space, resulting in poor generalizable ability in multiple domains. In this paper, we argue that the domain skew is reflected in the domain‑specific biased features of each client, causing the local model's representations to collapse into a narrow low‑dimensional subspace. We then propose Federated Feature Decoupling and Calibration (F^2DC), which liberates valuable class‑relevant information by calibrating the domain‑specific biased features, enabling more consistent representations across domains. A novel component, Domain Feature Decoupler (DFD), is first introduced in F^2DC to determine the robustness of each feature unit, thereby separating the local features into domain‑robust features and domain‑related features. A Domain Feature Corrector (DFC) is further proposed to calibrate these domain‑related features by explicitly linking discriminative signals, capturing additional class‑relevant clues that complement the domain‑robust features. Finally, a domain‑aware aggregation of the local models is performed to promote consensus among clients. Empirical results on three popular multi‑domain datasets demonstrate the effectiveness of the proposed F^2DC and the contributions of its two modules. Code is available at https://github.com/mala‑lab/F2DC.

Authors:Shahriar Kabir, Abdullah Muhammed Amimul Ehsan, Istiak Ahmmed Rifti, Md Kaykobad Reza
Title: DualSwinFusionSeg: Multimodal Martian Landslide Segmentation via Dual Swin Transformer with Multi-Scale Fusion and UNet++
Abstract:
Automated segmentation of Martian landslides, particularly in tectonically active regions such as Valles Marineris,is important for planetary geology, hazard assessment, and future robotic exploration. However, detecting landslides from planetary imagery is challenging due to the heterogeneous nature of available sensing modalities and the limited number of labeled samples. Each observation combines RGB imagery with geophysical measurements such as digital elevation models, slope maps, thermal inertia, and contextual grayscale imagery, which differ significantly in resolution and statistical properties. To address these challenges, we propose DualSwinFusionSeg, a multimodal segmentation architecture that separates modality‑specific feature extraction and performs multi‑scale cross‑modal fusion. The model employs two parallel Swin Transformer V2 encoders to independently process RGB and auxiliary geophysical inputs, producing hierarchical feature representations. Corresponding features from the two streams are fused at multiple scales and decoded using a UNet++ decoder with dense nested skip connections to preserve fine boundary details. Extensive ablation studies evaluate modality contributions, loss functions, decoder architectures, and fusion strategies. Experiments on the MMLSv2 dataset from the PBVS 2026 Mars‑LS Challenge show that modality‑specific encoders and simple concatenation‑based fusion improve segmentation accuracy under limited training data. The final model achieves 0.867 mIoU and 0.905 F1 on the development benchmark and 0.783 mIoU on the held‑out test set, demonstrating strong performance for multimodal planetary surface segmentation.

Authors:Hamza Mahmood, Abhishek Halder, Adeel Akhtar
Title: Schrödinger Bridge Over A Compact Connected Lie Group
Abstract:
This work studies the Schrödinger bridge problem for the kinematic equation on a compact connected Lie group. The objective is to steer a controlled diffusion between given initial and terminal densities supported over the Lie group while minimizing the control effort. We develop a coordinate‑free formulation of this stochastic optimal control problem that respects the underlying geometric structure of the Lie group, thereby avoiding limitations associated with local parameterizations or embeddings in Euclidean spaces. We establish the existence and uniqueness of solution to the corresponding Schrödinger system. Our results are constructive in that they derive a geometric controller that optimally interpolates probability densities supported over the Lie group. To illustrate the results, we provide numerical examples on \mathsfSO(2) and \mathsfSO(3). The codes and animations are publicly available at https://github.com/gradslab/SbpLieGroups.git .

Authors:N. Brag
Title: Benchmarking Open-Source PPG Foundation Models for Biological Age Prediction
Abstract:
A task‑specific model trained on 212,231 UK Biobank subjects to predict vascular age from PPG (AI‑PPG Age) fails on a different clinical population: predictions collapse to a narrow 38‑67 year range regardless of true age. Meanwhile, a general‑purpose foundation model with no age‑related training objective achieves lower error on the same data. We investigate why this happens and what it means for PPG‑based biological age prediction. We evaluate three open‑source PPG models (Pulse‑PPG, PaPaGei‑S, AI‑PPG Age) on 906 surgical patients from PulseDB, using frozen embeddings with Ridge regression and 5‑fold cross‑validation. Pulse‑PPG reaches MAE = 9.28 years, beating both AI‑PPG Age in linear probe mode (9.72) and HR/HRV combined with demographics (9.59). Adding demographic features brings the best result down to MAE = 8.22 years (R2 = 0.517, r = 0.725). The predicted age gap correlates with diastolic blood pressure after adjusting for chronological age (r = ‑0.188, p = 1.2e‑8), consistent with what Apple reported for their proprietary PpgAge model. The remaining gap with Apple (MAE 2.43) appears driven by dataset size (906 vs 213,593 subjects) and population differences rather than model architecture, as our learning curve shows no plateau. Code is publicly available.

Authors:Gwanwoo Song, Kwanyoung Park, Youngwoon Lee
Title: Chunk-Guided Q-Learning
Abstract:
In offline reinforcement learning (RL), single‑step temporal‑difference (TD) learning can suffer from bootstrapping error accumulation over long horizons. Action‑chunked TD methods mitigate this by backing up over multiple steps, but can introduce suboptimality by restricting the policy class to open‑loop action sequences. To resolve this trade‑off, we present Chunk‑Guided Q‑Learning (CGQ), a single‑step TD algorithm that guides a fine‑grained single‑step critic by regularizing it toward a chunk‑based critic trained using temporally extended backups. This reduces compounding error while preserving fine‑grained value propagation. We theoretically show that CGQ attains tighter critic optimality bounds than either single‑step or action‑chunked TD learning alone. Empirically, CGQ achieves strong performance on challenging long‑horizon OGBench tasks, often outperforming both single‑step and action‑chunked methods.

Authors:Kursat Komurcu, Linas Petkevicius
Title: Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing
Abstract:
Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel‑based errors but suffer from the "regression to the mean" problem, producing blurry outputs that obscure subtle geographic‑spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat‑JEPA‑Diff, which combines Self‑Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross‑attention adapter. This ensures that the synthesized high‑accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel‑2 dataset, Sat‑JEPA‑Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on https://github.com/VU‑AIML/SAT‑JEPA‑DIFF.

Authors:Shivnath Tathe
Title: True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity
Abstract:
Low‑precision neural network training has emerged as a promising direction for reducing computational costs and democratizing access to deep learning research. However, existing 4‑bit quantization methods either rely on expensive GPU infrastructure or suffer from significant accuracy degradation. In this work, we present a practical method for training convolutional neural networks at true 4‑bit precision using standard PyTorch operations on commodity CPUs. We introduce a novel tanh‑based soft weight clipping technique that, combined with symmetric quantization, dynamic per‑layer scaling, and straight‑through estimators, achieves stable convergence and competitive accuracy. Training a VGG‑style architecture with 3.25 million parameters from scratch on CIFAR‑10, our method achieves 92.34% test accuracy on Google Colab's free CPU tier ‑‑ matching full‑precision baseline performance (92.5%) with only a 0.16% gap. We further validate on CIFAR‑100, achieving 70.94% test accuracy across 100 classes with the same architecture and training procedure, demonstrating that 4‑bit training from scratch generalizes to harder classification tasks. Both experiments achieve 8x memory compression over FP32 while maintaining exactly 15 unique weight values per layer throughout training. We additionally validate hardware independence by demonstrating rapid convergence on a consumer mobile device (OnePlus 9R), achieving 83.16% accuracy in only 6 epochs. To the best of our knowledge, no prior work has demonstrated 4‑bit quantization‑aware training achieving full‑precision parity on standard CPU hardware without specialized kernels or post‑training quantization.

Authors:Jiahao Qin
Title: Collapse or Preserve: Data-Dependent Temporal Aggregation for Spiking Neural Network Acceleration
Abstract:
Spike sparsity is widely believed to enable efficient spiking neural network (SNN) inference on GPU hardware. We demonstrate this is an illusion: five distinct sparse computation strategies on Apple M3 Max all fail to outperform dense convolution, because SIMD architectures cannot exploit the fine‑grained, unstructured sparsity of i.i.d. binary spikes. Instead, we propose Temporal Aggregated Convolution (TAC), which exploits convolution linearity to pre‑aggregate K spike frames before a single convolution call, reducing T calls to T/K. On rate‑coded data, TAC achieves 13.8times speedup with +1.6% accuracy on MNIST and +5.4% on Fashion‑MNIST ‑‑ a simultaneous improvement in both speed and accuracy. However, on event‑based data where the temporal dimension carries genuine motion information, TAC's temporal collapse is harmful. We therefore introduce TAC‑TP (Temporal Preservation), which shares each group's convolution output across K independent LIF steps, preserving full temporal resolution for downstream layers. On DVS128‑Gesture, TAC‑TP achieves 95.1% accuracy (vs. 96.3% baseline) with 50% fewer convolution calls, while standard TAC drops to 91.3%. Our key finding is that the optimal temporal aggregation strategy is data‑dependent: collapse the temporal dimension for rate‑coded data (noise reduction) but preserve it for event data (information retention). Speedup is hardware‑agnostic: TAC achieves 11.0times on NVIDIA V100, confirming the mechanism transfers across GPU architectures. All operators in the mlx‑snn library are open source.

Authors:Dongyuan Li, Ying Zhang, Yaozu Wu, Renhe Jiang
Title: Node Role-Guided LLMs for Dynamic Graph Clustering
Abstract:
Dynamic graph clustering aims to detect and track time‑varying clusters in dynamic graphs, revealing how complex real‑world systems evolve over time. However, existing methods are predominantly black‑box models. They lack interpretability in their clustering decisions and fail to provide semantic explanations of why clusters form or how they evolve, severely limiting their use in safety‑critical domains such as healthcare or transportation. To address these limitations, we propose an end‑to‑end interpretable framework that maps continuous graph embeddings into discrete semantic concepts through learnable prototypes. Specifically, we first decompose node representations into orthogonal role and clustering subspaces, so that nodes with similar roles (e.g., hubs, bridges) but different cluster affiliations can be properly distinguished. We then introduce five node role prototypes (Leader, Contributor, Wanderer, Connector, Newcomer) in the role subspace as semantic anchors, transforming continuous embeddings into discrete concepts to facilitate LLM understanding of node roles within communities. Finally, we design a hierarchical LLM reasoning mechanism to generate both clustering results and natural language explanations, while providing consistency feedback as weak supervision to refine node representations. Experimental results on four synthetic and six real‑world benchmarks demonstrate the effectiveness, interpretability, and robustness of DyG‑RoLLM. Code is available at https://github.com/Clearloveyuan/DyG‑RoLLM.

Authors:Xuan Cui, Huiyue Li, Run Zeng, Yunfei Zhao, Jinrui Qian, Wei Duan, Bo Liu, Zhanpeng Zhou
Title: IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring
Abstract:
As large language models (LLMs) scale to billions of parameters, full‑parameter fine‑tuning becomes compute‑ and memory‑prohibitive. Parameter‑efficient fine‑tuning (PEFT) mitigates this issue by updating only a small set of task‑specific parameters while keeping the base model frozen. Among PEFT approaches, low‑rank adaptation (LoRA) is widely adopted; however, it enforces a uniform rank across layers despite substantial variation in layer importance, motivating layerwise rank allocation. Recent adaptive‑rank variants (e.g., AdaLoRA) allocate ranks based on importance scores, yet typically rely on instantaneous gradients that capture only local sensitivity, overlooking non‑local, pathwise effects within the same layer, which yields unstable and biased scores. To address this limitation, we introduce IGU‑LoRA, an adaptive‑rank LoRA that (i) computes within‑layer Integrated Gradients (IG) sensitivities and aggregates them into a layer‑level score for rank allocation, and (ii) applies an uncertainty‑aware scheme using exponential moving averages with deviation tracking to suppress noisy updates and calibrate rank selection. Theoretically, we prove an upper bound on the composite trapezoidal rule approximation error for parameter‑space IG under a pathwise Hessian‑Lipschitz condition, which informs the quadrature budget. Across diverse tasks and architectures, IGU‑LoRA consistently outperforms strong PEFT baselines at matched parameter budgets, improving downstream accuracy and robustness. Ablations confirm the contributions of pathwise within‑layer sensitivity estimates and uncertainty‑aware selection to effective rank allocation. Our code is publicly available at https://github.com/withyou12/igulora.git

Authors:Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li, Mo Chen, Ke Li
Title: Implicit Maximum Likelihood Estimation for Real-time Generative Model Predictive Control
Abstract:
Diffusion‑based models have recently shown strong performance in trajectory planning, as they are capable of capturing diverse, multimodal distributions of complex behaviors. A key limitation of these models is their slow inference speed, which results from the iterative denoising process. This makes them less suitable for real‑time applications such as closed‑loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well suited for real‑time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion‑based planner, while substantially improving planning speed in both open‑loop and closed‑loop settings. We further validate IMLE in a closed‑loop human navigation scenario, operating in real‑time, demonstrating how it enables rapid and adaptive plan generation in dynamic environments.

Authors:Zhaoyuan Gu, Yipu Chen, Zimeng Chai, Alfred Cueva, Thong Nguyen, Yifan Wu, Huishu Xue, Minji Kim, Isaac Legene, Fukang Liu, Matthew Kim, Ayan Barula, Yongxin Chen, Ye Zhao
Title: REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning
Abstract:
Humanoid loco‑manipulation requires coordinated high‑level motion plans with stable, low‑level whole‑body execution under complex robot‑environment dynamics and long‑horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low‑level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high‑dimensional humanoid systems. To address this challenge, we present REFINE‑DP (REinforcement learning FINE‑tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high‑level planner and an RL‑based low‑level loco‑manipulation controller. The DP is fine‑tuned via a PPO‑based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE‑DP on a humanoid robot performing loco‑manipulation tasks, including door traversal and long‑horizon object transport. REFINE‑DP achieves an over 90% success rate in simulation, even in out‑of‑distribution cases not seen in the pre‑trained data, and enables smooth autonomous task execution in real‑world dynamic environments. Our proposed method substantially outperforms pre‑trained DP baselines and demonstrates that RL fine‑tuning is key to reliable humanoid loco‑manipulation. https://refine‑dp.github.io/REFINE‑DP/

Authors:Dongyuan Li, Shun Zheng, Chang Xu, Jiang Bian, Renhe Jiang
Title: Routing Channel-Patch Dependencies in Time Series Forecasting with Graph Spectral Decomposition
Abstract:
Time series forecasting has attracted significant attention in the field of AI. Previous works have revealed that the Channel‑Independent (CI) strategy improves forecasting performance by modeling each channel individually, but it often suffers from poor generalization and overlooks meaningful inter‑channel interactions. Conversely, Channel‑Dependent (CD) strategies aggregate all channels, which may introduce irrelevant information and lead to oversmoothing. Despite recent progress, few existing methods offer the flexibility to adaptively balance CI and CD strategies in response to varying channel dependencies. To address this, we propose a generic plugin xCPD, that can adaptively model the channel‑patch dependencies from the perspective of graph spectral decomposition. Specifically, xCPD first projects multivariate signals into the frequency domain using a shared graph Fourier basis, and groups patches into low‑, mid‑, and high‑frequency bands based on their spectral energy responses. xCPD then applies a channel‑adaptive routing mechanism that dynamically adjusts the degree of inter‑channel interaction for each patch, enabling selective activation of frequency‑specific experts. This facilitates fine‑grained input‑aware modeling of smooth trends, local fluctuations, and abrupt transitions. xCPD can be seamlessly integrated on top of existing CI and CD forecasting models, consistently enhancing both accuracy and generalization across benchmarks. The code is available https://github.com/Clearloveyuan/xCPD.

Authors:Andrii Shchur, Inna Skarga-Bandurova
Title: MR-GNF: Multi-Resolution Graph Neural Forecasting on Ellipsoidal Meshes for Efficient Regional Weather Prediction
Abstract:
Weather forecasting offers an ideal testbed for artificial intelligence (AI) to learn complex, multi‑scale physical systems. Traditional numerical weather prediction remains computationally costly for frequent regional updates, as high‑resolution nests require intensive boundary coupling. We introduce Multi‑Resolution Graph Neural Forecasting (MR‑GNF), a lightweight, physics‑aware model that performs short‑term regional forecasts directly on an ellipsoidal, multi‑scale graph of the Earth. The framework couples a 0.25° region of interest with a 0.5° context belt and 1.0° outer domain, enabling continuous cross‑scale message passing without explicit nested boundaries. Its axial graph‑attention network alternates vertical self‑attention across pressure levels with horizontal graph attention across surface nodes, capturing implicit 3‑D structure in just 1.6 M parameters. Trained on 40 years of ERA5 reanalysis (1980‑2024), MR‑GNF delivers stable +6 h to +24 h forecasts for near‑surface temperature, wind, and precipitation over the UK‑Ireland sector. Despite a total compute cost below 80 GPU‑hours on a single RTX 6000 Ada, the model matches or exceeds heavier regional AI systems while preserving physical consistency across scales. These results demonstrate that graph‑based neural operators can achieve trustworthy, high‑resolution weather prediction at a fraction of NWP cost, opening a practical path toward AI‑driven early‑warning and renewable‑energy forecasting systems. Project page and code: https://github.com/AndriiShchur/MR‑GNF

Authors:Adrien Corenflos
Title: Robust Automatic Differentiation of Square-Root Kalman Filters via Gramian Differentials
Abstract:
Square‑root Kalman filters propagate state covariances in Cholesky‑factor form for numerical stability, and are a natural target for gradient‑based parameter learning in state‑space models. Their core operation, triangularization of a matrix M \in \mathbbR^n × m, is computed via a QR decomposition in practice, but naively differentiating through it causes two problems: the semi‑orthogonal factor is non‑unique when m > n, yielding undefined gradients; and the standard Jacobian formula involves inverses, which diverges when M is rank‑deficient. Both are resolved by the observation that all filter outputs relevant to learning depend on the input matrix only through the Gramian MM^\top, so the composite loss is smooth in M even where the triangularization is not. We derive a closed‑form chain‑rule directly from the differential of this Gramian identity, prove it exact for the Kalman log‑marginal likelihood and filtered moments, and extend it to rank‑deficient inputs via a two‑component decomposition: a column‑space term based on the Moore‑‑Penrose pseudoinverse, and a null‑space correction for perturbations outside the column space of M.

Authors:Pratik Ramesh, George Stoica, Arun Iyer, Leshem Choshen, Judy Hoffman
Title: Resolving Interference (RI): Disentangling Models for Improved Model Merging
Abstract:
Model merging has shown that multitask models can be created by directly combining the parameters of different models that are each specialized on tasks of interest. However, models trained independently on distinct tasks often exhibit interference that degrades the merged model's performance. To solve this problem, we formally define the notion of Cross‑Task Interference as the drift in the representation of the merged model relative to its constituent models. Reducing cross‑task interference is key to improving merging performance. To address this issue, we propose our method, Resolving Interference (RI), a light‑weight adaptation framework which disentangles expert models to be functionally orthogonal to the space of other tasks, thereby reducing cross‑task interference. RI does this whilst using only unlabeled auxiliary data as input (i.e., no task‑data is needed), allowing it to be applied in data‑scarce scenarios. RI consistently improves the performance of state‑of‑the‑art merging methods by up to 3.8% and generalization to unseen domains by up to 2.3%. We also find RI to be robust to the source of auxiliary input while being significantly less sensitive to tuning of merging hyperparameters. Our codebase is available at: https://github.com/pramesh39/resolving_interference

Authors:Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen
Title: Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding
Abstract:
In‑context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model's inherent in‑weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a task representation space and a sample representation space. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single‑value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few‑shot classification and a newly designed pseudo‑arithmetic task. Code: https://github.com/McGuinnessChen/dual‑representation‑space‑encoding

Authors:Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson
Title: CHIMERA-Bench: A Benchmark Dataset for Epitope-Specific Antibody Design
Abstract:
Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non‑overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub‑tasks with no common definition. We introduce \textscChimera‑Bench (CDR Modeling with Epitope‑guided Redesign), a unified benchmark built around a single canonical task: \emphepitope‑conditioned CDR sequence‑structure co‑design. \textscChimera‑Bench provides (1) a curated, deduplicated dataset of 2,922 antibody‑antigen complexes with epitope and paratope annotations; (2) three biologically motivated splits testing generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets; and (3) a comprehensive evaluation protocol with five metric groups including novel epitope‑specificity measures. We benchmark representative methods spanning different generative paradigms and report results across all splits. \textscChimera‑Bench is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability. The source code and data are available at: https://github.com/mansoor181/chimera‑bench.git

Authors:Jiajin Liu, Dongzhe Fan, Chuanhao Ji, Daochen Zha, Qiaoyu Tan
Title: GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
Abstract:
Vision‑Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real‑world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM‑as‑Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM‑as‑Aligner, which bridges modalities in latent or linguistic space to facilitate LLM‑based structured reasoning; and (3) VLM‑as‑Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM‑as‑Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision‑language models as a new foundation for multimodal graph learning. The benchmark code is publicly available at https://github.com/oamyjin/GraphVLM.

Authors:Jim Achterberg, Bram Van Dijk, Jing Meng, Saif Ul Islam, Gregory Epiphaniou, Carsten Maple, Xuefei Ding, Theodoros N. Arvanitis, Simon Brouwer, Marcel Haas, Marco Spruit
Title: OpenExtract: Automated Data Extraction for Systematic Reviews in Health
Abstract:
This study presents OpenExtract, an open‑source pipeline for automated data extraction in large‑scale systematic literature reviews. The pipeline queries large language models (LLMs) to predict data entries based on relevant sections of scientific articles. To test the efficacy of OpenExtract, we apply it to a systematic literature review in digital health and compare its outputs with those of human researchers. OpenExtract achieves precision and recall scores of > 0.8 in this task, indicating that it can be effective at extracting data automatically and efficiently. OpenExtract: https://github.com/JimAchterbergLUMC/OpenExtract.

Authors:Domen Preložnik, Žiga Špiclin
Title: Self-Supervised Multi-Stage Domain Unlearning for White-Matter Lesion Segmentation
Abstract:
Inter‑scanner variability of magnetic resonance imaging has an adverse impact on the diagnostic and prognostic quality of the scans and necessitates the development of models robust to domain shift inflicted by the unseen scanner data. Review of recent advances in domain adaptation showed that efficacy of strategies involving modifications or constraints on the latent space appears to be contingent upon the level and/or depth of supervision during model training. In this paper, we therefore propose an unsupervised domain adaptation technique based on self‑supervised multi‑stage unlearning (SSMSU). Building upon the state‑of‑the‑art segmentation framework nnU‑Net, we employ deep supervision at deep encoder stages using domain classifier unlearning, applied sequentially across the deep stages to suppress domain‑related latent features. Following self‑configurable approach of the nnU‑Net, the auxiliary feedback loop implements a self‑supervised backpropagation schedule for the unlearning process, since continuous unlearning was found to have a detrimental effect on the main segmentation task. Experiments were carried out on four public datasets for benchmarking white‑matter lesion segmentation methods. Five benchmark models and/or strategies, covering passive to active unsupervised domain adaptation, were tested. In comparison, the SSMSU demonstrated the advantage of unlearning by enhancing lesion sensitivity and limiting false detections, which resulted in higher overall segmentation quality in terms of segmentation overlap and relative lesion volume error. The proposed model inputs only the FLAIR modality, which simplifies preprocessing pipelines, eliminates the need for inter‑modality registration errors and harmonization, which can introduce variability. Source code is available on https://github.com/Pubec/nnunetv2‑unlearning.

Authors:Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, Zhijie Deng
Title: LightningRL: Breaking the Accuracy-Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning
Abstract:
Diffusion Large Language Models (dLLMs) have emerged as a promising paradigm for parallel token generation, with block‑wise variants garnering significant research interest. Despite their potential, existing dLLMs typically suffer from a rigid accuracy‑parallelism trade‑off: increasing the number of tokens per forward (TPF) via aggressive parallel decoding often leads to performance degradation and increased generation instability. We identify that this limitation stems from the model's inability to navigate high‑parallelism regimes where approximation errors and local corruptions accumulate, ultimately undermining the reliability of parallel generation. To address this, we propose LightningRL, a post‑training framework designed to directly optimize the speed‑quality Pareto frontier of pre‑trained dLLMs. Instead of forcing uniform parallelization, our approach leverages reinforcement learning to identify and reinforce high‑parallelism trajectories that maintain generation accuracy. Built upon the Group Relative Policy Optimization (GRPO) framework, LightningRL introduces several enhancements tailored for dLLMs: (1) stabilized training via per‑reward decoupled normalization; (2) token‑level negative log‑likelihood (NLL) regularization on correct trajectories to anchor model performance; and (3) a dynamic sampling strategy with TPF‑aware filtering to enhance training efficiency. Experimental results across mathematical and coding benchmarks demonstrate that LightningRL consistently advances the Pareto frontier, achieving competitive task accuracy while significantly increasing parallelism, reaching an average TPF of 7.32 (with a peak of 11.10 on the MBPP dataset). Our code is available at https://github.com/SJTU‑DENG‑Lab/LightningRL.

Authors:Idan Sulami, Alon Itzkovitch, Michael R. Kearney, Moni Shahar, Ofir Levy
Title: Spatially Aware Deep Learning for Microclimate Prediction from High-Resolution Geospatial Imagery
Abstract:
Microclimate models are essential for linking climate to ecological processes, yet most physically based frameworks estimate temperature independently for each spatial unit and rely on simplified representations of lateral heat exchange. As a result, the spatial scales over which surrounding environmental conditions influence local microclimates remain poorly quantified. Here, we show how remote sensing can help quantify the contribution of spatial context to microclimate temperature predictions. Building on convolutional neural network principles, we designed a task‑specific deep neural network and trained a series of models in which the spatial extent of input data was systematically varied. Drone‑derived spatial layers and meteorological data were used to predict ground temperature at a focal location, allowing direct assessment of how prediction accuracy changes with increasing spatial context. Our results show that incorporating spatially adjacent information substantially improves prediction accuracy, with diminishing returns beyond spatial extents of approximately 5‑7 m. This characteristic scale indicates that ground temperatures are influenced not only by local surface properties, but also by horizontal heat transfer and radiative interactions operating across neighboring microhabitats. The magnitude of spatial effects varied systematically with time of day, microhabitat type, and local environmental characteristics, highlighting context‑dependent spatial coupling in microclimate formation. By treating deep learning as a diagnostic tool rather than solely a predictive one, our approach provides a general and transferable method for quantifying spatial dependencies in microclimate models and informing the development of hybrid mechanistic‑data‑driven approaches that explicitly account for spatial interactions while retaining physical interpretability.

Authors:Minsang Kim, Seung Jun Baek
Title: Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation
Abstract:
Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain‑of‑Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token‑Selective Dual Knowledge Distillation (TSD‑KD), a framework for student‑centric distillation. TSD‑KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD‑KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re‑ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self‑improvement. The experiments show the state‑of‑the‑art performance of TSD‑KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner‑up in accuracy by up to 54.4% and 40.3%, respectively. Notably, a student trained by TSD‑KD even outperformed its own teacher model in four cases by up to 20.3%. The source code is available at https://github.com/kmswin1/TSD‑KD.

Authors:Helen Qu, Rudy Morel, Michael McCabe, Alberto Bietti, François Lanusse, Shirley Ho, Yann LeCun
Title: Representation Learning for Spatiotemporal Physical Systems
Abstract:
Machine learning approaches to spatiotemporal physical systems have primarily focused on next‑frame prediction, with the goal of learning an accurate emulator for the system's evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation of a system's governing physical parameters. Accuracy on these tasks offers a uniquely quantifiable glimpse into the physical relevance of the representations of these models. We evaluate the effectiveness of general‑purpose self‑supervised methods in learning physics‑grounded representations that are useful for downstream scientific tasks. Surprisingly, we find that not all methods designed for physical modeling outperform generic self‑supervised learning methods on these tasks, and methods that learn in the latent space (e.g., joint embedding predictive architectures, or JEPAs) outperform those optimizing pixel‑level prediction objectives. Code is available at https://github.com/helenqu/physical‑representation‑learning.

Authors:Yiqi Zhou, Yue Yuan, Yikai Wang, Bohao Liu, Qinxin Mei, Zhuohua Liu, Shan Shen, Wei Xing, Daying Sun, Li Li, Guozhu Liu
Title: OpenACMv2: An Accuracy-Constrained Co-Optimization Framework for Approximate DCiM
Abstract:
Digital Compute‑in‑Memory (DCiM) accelerates neural networks by reducing data movement. Approximate DCiM can further improve power‑performance‑area (PPA), but demands accuracy‑constrained co‑optimization across coupled architecture and transistor‑level choices. Building on OpenYield, we introduce Accuracy‑Constrained Co‑Optimization (ACCO) and present OpenACMv2, an open framework that operationalizes ACCO via two‑level optimization: (1) accuracy‑constrained architecture search of compressor combinations and SRAM macro parameters, driven by a fast GNN‑based surrogate for PPA and error; and (2) variation‑ and PVT‑aware transistor sizing for standard cells and SRAM bitcells using Monte Carlo. By decoupling ACCO into architecture‑level exploration and circuit‑level sizing, OpenACMv2 integrates classic single‑ and multi‑objective optimizers to deliver strong PPA‑accuracy tradeoffs and robust convergence. The workflow is compatible with FreePDK45 and OpenROAD, supporting reproducible evaluation and easy adoption. Experiments demonstrate significant PPA improvements under controlled accuracy budgets, enabling rapid "what‑if" exploration for approximate DCiM. The framework is available on https://github.com/ShenShan123/OpenACM.

Authors:Steven Motta, Gioele Nanni
Title: Federated Few-Shot Learning on Neuromorphic Hardware: An Empirical Study Across Physical Edge Nodes
Abstract:
Federated learning on neuromorphic hardware remains unexplored because on‑chip spike‑timing‑dependent plasticity (STDP) produces binary weight updates rather than the floating‑point gradients assumed by standard algorithms. We build a two‑node federated system with BrainChip Akida AKD1000 processors and run approximately 1,580 experimental trials across seven analysis phases. Of four weight‑exchange strategies tested, neuron‑level concatenation (FedUnion) consistently preserves accuracy while element‑wise weight averaging (FedAvg) destroys it (p = 0.002). Domain‑adaptive fine‑tuning of the upstream feature extractor accounts for most of the accuracy gains, confirming feature quality as the dominant factor. Scaling feature dimensionality from 64 to 256 yields 77.0% best‑strategy federated accuracy (n=30, p < 0.001). Two independent asymmetries (wider features help federation more than individual learning, while binarization hurts federation more) point to a shared prototype complementarity mechanism: cross‑node transfer scales with the distinctiveness of neuron prototypes.

Authors:Tianhao Fu, Bingxuan Yang, Juncheng Guo, Shrena Sribalan, Yucheng Chen
Title: SortScrews: A Dataset and Baseline for Real-time Screw Classification
Abstract:
Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single‑object scenarios commonly encountered in automated sorting systems. In this work, we introduce SortScrews, a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at 512×512 resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet‑B0 and ResNet‑18 classifiers pretrained on ImageNet. In addition, we conduct a well‑explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at https://github.com/ATATC/SortScrews.

Authors:Chenlong Yin, Runpeng Geng, Yanting Wang, Jinyuan Jia
Title: PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
Abstract:
Prompt injection poses serious security risks to real‑world LLM applications, particularly autonomous agents. Although many defenses have been proposed, their robustness against adaptive attacks remains insufficiently evaluated, potentially creating a false sense of security. In this work, we propose PISmith, a reinforcement learning (RL)‑based red‑teaming framework that systematically assesses existing prompt‑injection defenses by training an attack LLM to optimize injected prompts in a practical black‑box setting, where the attacker can only query the defended LLM and observe its outputs. We find that directly applying standard GRPO to attack strong defenses leads to sub‑optimal performance due to extreme reward sparsity ‑‑ most generated injected prompts are blocked by the defense, causing the policy's entropy to collapse before discovering effective attack strategies, while the rare successes cannot be learned effectively. In response, we introduce adaptive entropy regularization and dynamic advantage weighting to sustain exploration and amplify learning from scarce successes. Extensive evaluation on 13 benchmarks demonstrates that state‑of‑the‑art prompt injection defenses remain vulnerable to adaptive attacks. We also compare PISmith with 7 baselines across static, search‑based, and RL‑based attack categories, showing that PISmith consistently achieves the highest attack success rates. Furthermore, PISmith achieves strong performance in agentic settings on InjecAgent and AgentDojo against both open‑source and closed‑source LLMs (e.g., GPT‑4o‑mini and GPT‑5‑nano). Our code is available at https://github.com/albert‑y1n/PISmith.

Authors:Shaofeng Guo, Jiequan Cui, Richang Hong
Title: Rethinking VLMs for Image Forgery Detection and Localization
Abstract:
With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision‑language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL‑VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in‑domain and cross‑dataset generalization settings. The experimental results show that we consistently achieve new state‑of‑the‑art performance in detection, localization, and interpretability.Code is available at: https://github.com/sha0fengGuo/IFDL‑VLM.

Authors:Kadir-Kaan Özer, René Ebeling, Markus Enzweiler
Title: Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection
Abstract:
Multivariate time series anomalies often manifest as shifts in cross‑channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration. Residual‑based detectors can miss such anomalies when flexible sequence models still reconstruct signals plausibly despite altered coordination. We introduce AxonAD, an unsupervised detector that treats multi‑head attention query evolution as a short horizon predictable process. A gradient‑updated reconstruction pathway is coupled with a history‑only predictor that forecasts future query vectors from past context. This is trained via a masked predictor‑target objective against an exponential moving average (EMA) target encoder. At inference, reconstruction error is combined with a tail‑aggregated query mismatch score, which measures cosine deviation between predicted and target queries on recent timesteps. This dual approach provides sensitivity to structural dependency shifts while retaining amplitude‑level detection. On proprietary in‑vehicle telemetry with interval annotations and on the TSB‑AD multi‑variate suite (17 datasets, 180 series) with threshold‑free and range‑aware metrics, AxonAD improves ranking quality and temporal localization over strong baselines. Ablations confirm that query prediction and combined scoring are the primary drivers of the observed gains. Code is available at the URL https://github.com/iis‑esslingen/AxonAD.

Authors:David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, Samuli Laine
Title: Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models
Abstract:
Reinforcement learning (RL) has become a standard technique for post‑training diffusion‑based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high‑quality vision language models and off‑the‑shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

Authors:Yiqun Zhang, Zexi Tan, Xiaopeng Luo, Yunlin Liu
Title: Hierarchical Reference Sets for Robust Unsupervised Detection of Scattered and Clustered Outliers
Abstract:
Most real‑world IoT data analysis tasks, such as clustering and anomaly event detection, are unsupervised and highly susceptible to the presence of outliers. In addition to sporadic scattered outliers caused by factors such as faulty sensor readings, IoT systems often exhibit clustered outliers. These occur when multiple devices or nodes produce similar anomalous measurements, for instance, owing to localized interference, emerging security threats, or regional false alarms, forming micro‑clusters. These clustered outliers can be easily mistaken for normal behavior because of their relatively high local density, thereby obscuring the detection of both scattered and contextual anomalies. To address this, we propose a novel outlier detection paradigm that leverages the natural neighboring relationships using graph structures. This facilitates multi‑perspective anomaly evaluation by incorporating reference sets at both local and global scales derived from the graph. Our approach enables the effective recognition of scattered outliers without interference from clustered anomalies, whereas the graph structure simultaneously helps reflect and isolate clustered outlier groups. Extensive experiments, including comparative performance analysis, ablation studies, validation on downstream clustering tasks, and evaluation of hyperparameter sensitivity, demonstrate the efficacy of the proposed method. The source code is available at https://github.com/gordonlok/DROD.

Authors:Yonghun Jeong, David Yoon Suk Kang, Yeon-Chang Lee
Title: Anchored Alignment: Preventing Positional Collapse in Multimodal Recommender Systems
Abstract:
Multimodal recommender systems (MMRS) leverage images, text, and interaction signals to enrich item representations. However, recent alignment based MMRSs that enforce a unified embedding space often blur modality specific structures and exacerbate ID dominance. Therefore, we propose AnchorRec, a multimodal recommendation framework that performs indirect, anchor based alignment in a lightweight projection domain. By decoupling alignment from representation learning, AnchorRec preserves each modality's native structure while maintaining cross modal consistency and avoiding positional collapse. Experiments on four Amazon datasets show that AnchorRec achieves competitive top N recommendation accuracy, while qualitative analyses demonstrate improved multimodal expressiveness and coherence. The codebase of AnchorRec is available at https://github.com/hun9008/AnchorRec.

Authors:Jillur Rahman Saurav, Thuong Le Hoai Pham, Pritam Mukherjee, Paul Yi, Brent A. Orr, Jacob M. Luber
Title: UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC
Abstract:
Virtual immunohistochemistry (IHC) staining from hematoxylin and eosin (H&E) images can accelerate diagnostics by providing preliminary molecular insight directly from routine sections, reducing the need for repeat sectioning when tissue is limited. Existing methods improve realism through contrastive objectives, prototype matching, or domain alignment, yet the generator itself receives no direct guidance from pathology foundation models. We present UNIStainNet, a SPADE‑UNet conditioned on dense spatial tokens from a frozen pathology foundation model (UNI), providing tissue‑level semantic guidance for stain translation. A misalignment‑aware loss suite preserves stain quantification accuracy, and learned stain embeddings enable a single model to serve multiple IHC markers simultaneously. On MIST, UNIStainNet achieves state‑of‑the‑art distributional metrics on all four stains (HER2, Ki67, ER, PR) from a single unified model, where prior methods typically train separate per‑stain models. On BCI, it also achieves the best distributional metrics. A tissue‑type stratified failure analysis reveals that remaining errors are systematic, concentrating in non‑tumor tissue. Code is available at https://github.com/facevoid/UNIStainNet.

Authors:Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper, Ling Liu
Title: Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning
Abstract:
With the growing number and diversity of Vision‑Language Models (VLMs), many works explore language‑based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi‑model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA‑based focal diversity metric (CKA‑focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal‑diversity fused predictions with high performance for vision‑language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A‑OKVQA, MMMU, MMMU‑Pro, and OCR‑VQA). The results show that V3Fusion outperforms the best‑performing VLM on MMMU by 8.09% and MMMU‑Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern‑VL2‑8b and Qwen2.5‑VL‑7b, the top‑2 VLM performers on both A‑OKVQA and OCR‑VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.

Authors:Gihoon Kim, Euntai Kim
Title: Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
Abstract:
Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large‑scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user‑specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single‑reward model. To overcome this limitation, we propose Swap‑guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap‑guided base regularization, (2) Preferential Inverse Autoregressive Flow (P‑IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user‑specific latents, and improves preference prediction. Our code and data are available at https://github.com/cobang0111/SPL

Authors:Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland
Title: Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Abstract:
Reinforcement learning (RL) has been effective for post‑training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence‑level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion‑based sequence generation as a finite‑horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute‑efficient estimator, we (i) select denoising steps for policy updates via an entropy‑guided approximation bound, and (ii) estimate intermediate advantages using a one‑step denoising reward naturally provided by the diffusion model, avoiding costly multi‑step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state‑of‑the‑art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post‑training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo‑dllm‑rl.

Authors:Shivam Chaudhary, Sheethal Bhat, Andreas Maier
Title: Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding
Abstract:
Accurate detection and localization of traumatic injuries in abdominal CT scans remains a critical challenge in emergency radiology, primarily due to severe scarcity of annotated medical data. This paper presents a label‑efficient approach combining self‑supervised pre‑training with semi‑supervised detection for 3D medical image analysis. We employ patch‑based Masked Image Modeling (MIM) to pre‑train a 3D U‑Net encoder on 1,206 CT volumes without annotations, learning robust anatomical representations. The pretrained encoder enables two downstream clinical tasks: 3D injury detection using VDETR with Vertex Relative Position Encoding, and multi‑label injury classification. For detection, semi‑supervised learning with 2,000 unlabeled volumes and consistency regularization achieves 56.57% validation mAP@0.50 and 45.30% test mAP@0.50 with only 144 labeled training samples, representing a 115% improvement over supervised‑only training. For classification, expanding to 2,244 labeled samples yields 94.07% test accuracy across seven injury categories using only a frozen encoder, demonstrating immediately transferable self‑supervised features. Our results validate that self‑supervised pre‑training combined with semi‑supervised learning effectively addresses label scarcity in medical imaging, enabling robust 3D object detection with limited annotations.

Authors:Joong Ho Kim, Nicholas Thai, Souhardya Saha Dip, Dong Lao, Keith G. Mills
Title: Naïve PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation
Abstract:
Text‑to‑Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user‑defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks.

Authors:Rujie Wu, Haozhe Zhao, Hai Ci, Yizhou Wang
Title: Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning
Abstract:
Multimodal instruction tuning is often compute‑inefficient because training budgets are spread across large mixed image‑video pools whose utility is highly uneven. We present Goal‑Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1× training subsets for different goals. Under a fixed one‑epoch Qwen3‑VL‑8B‑Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni‑10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k‑sample Uni‑10x baseline, GDO reaches the Uni‑10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra‑long‑video setting and the mismatch between that benchmark and the short‑video/image‑dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long‑video understanding behavior. Overall, GDO provides a goal‑driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.

Authors:Davi Bonetto
Title: SpectralGuard: Detecting Memory Collapse Attacks in State Space Models
Abstract:
State Space Models (SSMs) such as Mamba achieve linear‑time sequence processing through input‑dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A‑bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient‑based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output‑level alarms. We prove an Evasion Existence Theorem showing that for any output‑only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real‑time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non‑adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub‑15ms per‑token latency. Causal interventions and cross‑architecture transfer to hybrid SSM‑Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.

Authors:Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri
Title: Sinkhorn-Drifting Generative Models
Abstract:
We establish a theoretical link between the recently proposed "drifting" generative dynamics and gradient flows induced by the Sinkhorn divergence. In a particle discretization, the drift field admits a cross‑minus‑self decomposition: an attractive term toward the target distribution and a repulsive/self‑correction term toward the current model, both expressed via one‑sided normalized Gibbs kernels. We show that Sinkhorn divergence yields an analogous cross‑minus‑self structure, but with each term defined by entropic optimal‑transport couplings obtained through two‑sided Sinkhorn scaling (i.e., enforcing both marginals). This provides a precise sense in which drifting acts as a surrogate for a Sinkhorn‑divergence gradient flow, interpolating between one‑sided normalization and full two‑sided Sinkhorn scaling. Crucially, this connection resolves an identifiability gap in prior drifting formulations: leveraging the definiteness of the Sinkhorn divergence, we show that zero drift (equilibrium of the dynamics) implies that the model and target measures match. Experiments show that Sinkhorn drifting reduces sensitivity to kernel temperature and improves one‑step generative quality, trading off additional training time for a more stable optimization, without altering the inference procedure used by drift methods. These theoretical gains translate to strong low‑temperature improvements in practice: on FFHQ‑ALAE at the lowest temperature setting we evaluate, Sinkhorn drifting reduces mean FID from 187.7 to 37.1 and mean latent EMD from 453.3 to 144.4, while on MNIST it preserves full class coverage across the temperature sweep. Project page: https://mint‑vu.github.io/SinkhornDrifting/

Authors:Ziwei Wang, Zhentao He, Xingyi He, Hongbin Wang, Tianwang Jia, Jingwei Luo, Siyang Li, Xiaoqing Chen, Dongrui Wu
Title: Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions
Abstract:
Deep learning has achieved transformative performance across diverse domains, largely driven by the large‑scale, high‑quality training data. In contrast, the development of brain‑computer interfaces (BCIs) is fundamentally constrained by the limited, heterogeneous, and privacy‑sensitive neural recordings. Generating synthetic yet physiologically plausible brain signals has therefore emerged as a compelling way to mitigate data scarcity and enhance model capacity. This survey provides a comprehensive review of brain signal generation for BCIs, covering methodological taxonomies, benchmark experiments, evaluation metrics, and key applications. We systematically categorize existing generative algorithms into four types: knowledge‑based, feature‑based, model‑based, and translation‑based approaches. Furthermore, we benchmark existing brain signal generation approaches across four representative BCI paradigms to provide an objective performance comparison. Finally, we discuss the potentials and challenges of current generation approaches and prospect future research on accurate, data‑efficient, and privacy‑aware BCI systems. The benchmark codebase is publicized at https://github.com/wzwvv/DG4BCI.

Authors:Yining Qian, Lijie Su, Meiling Xu, Xianpeng Wang
Title: Multi-objective Genetic Programming with Multi-view Multi-level Feature for Enhanced Protein Secondary Structure Prediction
Abstract:
Predicting protein secondary structure is essential for understanding protein function and advancing drug discovery. However, the intricate sequence‑structure relationship poses significant challenges for accurate modeling. To address these, we propose MOGP‑MMF, a multi‑objective genetic programming framework that reformulates PSSP as an automated optimization task focused on feature selection and fusion. Specifically, MOGP‑MMF introduces a multi‑view multi‑level representation strategy that integrates evolutionary, semantic, and newly introduced structural views to capture the comprehensive protein folding logic. Leveraging an enriched operator set, the framework evolves both linear and nonlinear fusion functions, effectively capturing high‑order feature interactions while reducing fusion complexity. To resolve the accuracy‑complexity trade‑off, an improved multi‑objective GP algorithm is developed, incorporating a knowledge transfer mechanism that utilizes prior evolutionary experience to guide the population toward global optima. Extensive experiments across seven benchmark datasets demonstrate that MOGP‑MMF surpasses state‑of‑the‑art methods, particularly in Q8 accuracy and structural integrity. Furthermore, MOGP‑MMF generates a diverse set of non‑dominated solutions, offering flexible model selection schemes for various practical application scenarios. The source code is available on GitHub: https://github.com/qian‑ann/MOGP‑MMF/tree/main.

Authors:Terrence J. Lee-St. John, Jordan L. Lawson, Bartlomiej Piechowski-Jozwiak
Title: From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness
Abstract:
Tabular machine learning presents a paradox: modern models achieve state‑of‑the‑art performance using high‑dimensional (high‑D), collinear, error‑prone data, defying the "Garbage In, Garbage Out" mantra. To help resolve this, we synthesize principles from Information Theory, Latent Factor Models, and Psychometrics, clarifying that predictive robustness arises not solely from data cleanliness, but from the synergy between data architecture and model capacity. Partitioning predictor‑space "noise" into "Predictor Error" and "Structural Uncertainty" (informational deficits from stochastic generative mappings), we prove that leveraging high‑D sets of error‑prone predictors asymptotically overcomes both types of noise, whereas cleaning a low‑D set is fundamentally bounded by Structural Uncertainty. We demonstrate why "Informative Collinearity" (dependencies from shared latent causes) enhances reliability and convergence efficiency, and explain why increased dimensionality reduces the latent inference burden, enabling feasibility with finite samples. To address practical constraints, we propose "Proactive Data‑Centric AI" to identify predictors that enable robustness efficiently. We also derive boundaries for Systematic Error Regimes and show why models that absorb "rogue" dependencies can mitigate assumption violations. Linking latent architecture to Benign Overfitting, we offer a first step towards a unified view of robustness to Outcome Error and predictor‑space noise, while also delineating when traditional DCAI's focus on label cleaning remains powerful. By redefining data quality from item‑level perfection to portfolio‑level architecture, we provide a theoretical rationale for "Local Factories" ‑‑ learning from live, uncurated enterprise "data swamps" ‑‑ supporting a deployment paradigm shift from "Model Transfer" to "Methodology Transfer'' to overcome static generalizability limitations.

Authors:Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata
Title: The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
Abstract:
Text‑to‑image generation models have advanced rapidly, yet achieving fine‑grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training‑free method in FLUX based solely on closed‑form latent‑space manipulation. Code is available at https://github.com/ExplainableML/LCS.

Authors:Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan
Title: Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
Abstract:
Humans perceive and understand real‑world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial‑TTT towards streaming visual‑based spatial intelligence with test‑time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long‑horizon scene videos. Specifically, we design a hybrid architecture and adopt large‑chunk updates parallel with sliding‑window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial‑predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial‑TTT improves long‑horizon spatial understanding and achieves state‑of‑the‑art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial‑TTT.

Authors:Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen
Title: BiGain: Unified Token Compression for Joint Generation and Classification
Abstract:
Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training‑free, plug‑and‑play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature‑space signals into a frequency‑aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency‑aware operators: (1) Laplacian‑gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high‑contrast tokens, thereby retaining edges and textures; and (2) Interpolate‑Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT‑ and U‑Net‑based backbones and ImageNet‑1K, ImageNet‑100, Oxford‑IIIT Pets, and COCO‑2017, our operators consistently improve the speed‑accuracy trade‑off for diffusion‑based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet‑1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high‑frequency detail and low/mid‑frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower‑cost deployment.

Authors:Yulu Gan, Phillip Isola
Title: Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights
Abstract:
Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task‑specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well‑pretrained models the density of task‑experts increases dramatically, so that diverse, task‑improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post‑training method that samples N parameter perturbations at random, selects the top K, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post‑training methods such as PPO, GRPO, and ES for contemporary large‑scale models.

Authors:Jae-Won Chung, Jeff J. Ma, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury
Title: Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
Abstract:
Any‑to‑Any models are an emerging class of multimodal models that accept combinations of multimodal data (e.g., text, image, video, audio) as input and generate them as output. Serving these models are challenging; different requests with different input and output modalities traverse different paths through the model computation graph, and each component of the model have different scaling characteristics. We present Cornserve, a distributed serving system for generic Any‑to‑Any models. Cornserve provides a flexible task abstraction for expressing Any‑to‑Any model computation graphs, enabling component disaggregation and independent scaling. The distributed runtime dispatches compute to the data plane via an efficient record‑and‑replay execution model that keeps track of data dependencies, and forwards tensor data between components directly from the producer to the consumer. Built on Kubernetes with approximately 23K new lines of Python, Cornserve supports diverse Any‑to‑Any models and delivers up to 3.81× higher throughput and 5.79× lower tail latency. Cornserve is open‑source, and the demo video is available on YouTube.

Authors:Ming-Hong Chen, Kuan-Chen Pan, You-De Huang, Xi Liu, Ping-Chun Hsieh
Title: Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
Abstract:
Cross‑domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross‑domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter‑domain mappings; (ii) The transferability of a source‑domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of cross‑domain Bellman consistency and hybrid critic. Specifically, we first introduce the notion of cross‑domain Bellman consistency as a way to measure transferability of a source‑domain model. Then, we propose QAvatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter‑free weight function. Through this design, we characterize the convergence behavior of QAvatar and show that QAvatar achieves reliable transfer in the sense that it effectively leverages a source‑domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that QAvatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation. Our code is available at https://rl‑bandits‑lab.github.io/Cross‑Domain‑RL/.

Authors:Ping Guo, Tiantian Zhang, Xi Lin, Xiang Li, Zhi-Ri Tang, Qingfu Zhang
Title: Few-for-Many Personalized Federated Learning
Abstract:
Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data privacy. Existing approaches often rely on heuristics like clustering or model interpolation, which lack principled mechanisms for balancing heterogeneous client objectives. Serving M clients with distinct data distributions is inherently a multi‑objective optimization problem, where achieving optimal personalization ideally requires M distinct models on the Pareto front. However, maintaining M separate models poses significant scalability challenges in federated settings with hundreds or thousands of clients. To address this challenge, we reformulate PFL as a few‑for‑many optimization problem that maintains only K shared server models (K \ll M) to collectively serve all M clients. We prove that this framework achieves near‑optimal personalization: the approximation error diminishes as K increases and each client's model converges to each client's optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the K server models through efficient gradient‑based updates. Unlike clustering‑based approaches that require manual client partitioning or interpolation‑based methods that demand careful hyperparameter tuning, FedFew automatically discovers the optimal model diversity through its optimization process. Experiments across vision, NLP, and real‑world medical imaging datasets demonstrate that FedFew, with just 3 models, consistently outperforms other state‑of‑the‑art approaches. Code is available at https://github.com/pgg3/FedFew.

Authors:Ilias Aarab
Title: BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
Abstract:
Zero‑shot text classification (ZSC) offers the promise of eliminating costly task‑specific annotation by matching texts directly to human‑readable label descriptions. While early approaches have predominantly relied on cross‑encoder models fine‑tuned for natural language inference (NLI), recent advances in text‑embedding models, rerankers, and instruction‑tuned large language models (LLMs) have challenged the dominance of NLI‑based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine‑tuning, leaving genuine zero‑shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross‑encoders, embedding models, rerankers and instruction‑tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3‑Reranker‑8B, set a new state‑of‑the‑art with macro F1 = 0.72; (ii) strong embedding models such as GTE‑large‑en‑v1.5 substantially close the accuracy gap while offering the best trade‑off between accuracy and latency; (iii) instruction‑tuned LLMs at 4‑‑12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross‑encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero‑shot text understanding.

Authors:Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A. R. Abu-Bakar, Zhaode Wang, Chengfei Lv, Haoji Hu, Huan Wang
Title: MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile devices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross‑framework interoperability, coupled with an automated pipeline that bridges the host‑device gap for on‑device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inherent to mobile frameworks; standard models and even fine‑tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain‑specific grounding. To overcome these limitations, we propose the Mobile Kernel Agent (MoKA), a multi‑agent system equipped with repository‑aware reasoning and a plan‑and‑execute paradigm. Validated on MobileKernelBench, MoKA achieves state‑of‑the‑art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernels to deliver measurable speedups over native libraries.

Authors:Fengyuan Yu, Xiaohua Feng, Yuyuan Li, Changwang Zhang, Jun Wang, Chaochao Chen
Title: Sharpness-Aware Minimization for Generalized Embedding Learning in Federated Recommendation
Abstract:
Federated recommender systems enable collaborative model training while keeping user interaction data local and sharing only essential model parameters, thereby mitigating privacy risks. However, existing methods overlook a critical issue, i.e., the stable learning of a generalized item embedding throughout the federated recommender system training process. Item embedding plays a central role in facilitating knowledge sharing across clients. Yet, under the cross‑device setting, local data distributions exhibit significant heterogeneity and sparsity, exacerbating the difficulty of learning generalized embeddings. These factors make the stable learning of generalized item embeddings both indispensable for effective federated recommendation and inherently difficult to achieve. To fill this gap, we propose a new federated recommendation framework, named Federated Recommendation with Generalized Embedding Learning (FedRecGEL). We reformulate the federated recommendation problem from an item‑centered perspective and cast it as a multi‑task learning problem, aiming to learn generalized embeddings throughout the training procedure. Based on theoretical analysis, we employ sharpness‑aware minimization to address the generalization problem, thereby stabilizing the training process and enhancing recommendation performance. Extensive experiments on four datasets demonstrate the effectiveness of FedRecGEL in significantly improving federated recommendation performance. Our code is available at https://github.com/anonymifish/FedRecGEL.

Authors:Yuxiang Liu, Qiao Liu, Tong Luo, Yanglei Gan, Peng He, Yao LIu
Title: Bridging Discrete Marks and Continuous Dynamics: Dual-Path Cross-Interaction for Marked Temporal Point Processes
Abstract:
Predicting irregularly spaced event sequences with discrete marks poses significant challenges due to the complex, asynchronous dependencies embedded within continuous‑time data streams.Existing sequential approaches capture dependencies among event tokens but ignore the continuous evolution between events, while Neural Ordinary Differential Equation (Neural ODE) methods model smooth dynamics yet fail to account for how event types influence future timing.To overcome these limitations, we propose NEXTPP, a dual‑channel framework that unifies discrete and continuous representations via Event‑granular Neural Evolution with Cross‑Interaction for Marked Temporal Point Processes. Specifically, NEXTPP encodes discrete event marks via a self‑attention mechanism, simultaneously evolving a latent continuous‑time state using a Neural ODE. These parallel streams are then fused through a crossattention module to enable explicit bidirectional interaction between continuous and discrete representations. The fused representations drive the conditional intensity function of the neural Hawkes process, while an iterative thinning sampler is employed to generate future events. Extensive evaluations on five real‑world datasets demonstrate that NEXTPP consistently outperforms state‑of‑the‑art models. The source code can be found at https://github.com/AONE‑NLP/NEXTPP.

Authors:Ehsan Hoseinzade, Ke Wang
Title: ZTab: Domain-based Zero-shot Annotation for Table Columns
Abstract:
This study addresses the challenge of automatically detecting semantic column types in relational tables, a key task in many real‑world applications. Zero‑shot modeling eliminates the need for user‑provided labeled training data, making it ideal for scenarios where data collection is costly or restricted due to privacy concerns. However, existing zero‑shot models suffer from poor performance when the number of semantic column types is large, limited understanding of tabular structure, and privacy risks arising from dependence on high‑performance closed‑source LLMs. We introduce ZTab, a domain‑based zero‑shot framework that addresses both performance and zero‑shot requirements. Given a domain configuration consisting of a set of predefined semantic types and sample table schemas, ZTab generates pseudo‑tables for the sample schemas and fine‑tunes an annotation LLM on them. ZTab is domain‑based zero‑shot in that it does not depend on user‑specific labeled training data; therefore, no retraining is needed for a test table from a similar domain. We describe three cases of domain‑based zero‑shot. The domain configuration of ZTab provides a trade‑off between the extent of zero‑shot and annotation performance: a "universal domain" that contains all semantic types approaches "pure" zero‑shot, while a "specialized domain" that contains semantic types for a specific application enables better zero‑shot performance within that domain. Source code and datasets are available at https://github.com/hoseinzadeehsan/ZTab

Authors:Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi
Title: Meta-Reinforcement Learning with Self-Reflection for Agentic Search
Abstract:
This paper introduces MR‑Search, an in‑context meta reinforcement learning (RL) formulation for agentic search with self‑reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR‑Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR‑Search learns to learn a search strategy with self‑reflection, allowing search agents to improve in‑context exploration at test‑time. Specifically, MR‑Search performs cross‑episode exploration by generating explicit self‑reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test‑time. We further introduce a multi‑turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine‑grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR‑Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR‑Search.

Authors:Massimiliano Altieri, Ronan Hamon, Roberto Corizzo, Michelangelo Ceci, Ignacio Sanchez
Title: DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries
Abstract:
Network intrusion detection systems play a crucial role in the security strategy employed by organisations to detect and prevent cyberattacks. Such systems usually combine pattern detection signatures with anomaly detection techniques powered by machine learning methods. However, the commonly proposed machine learning methods present drawbacks such as over‑reliance on labeled data and limited generalization capabilities. To address these issues, embedding‑based methods have been introduced to learn representations from network data, such as DNS traffic, mainly due to its large availability, that generalise effectively to many downstream tasks. However, current approaches do not properly consider contextual information among DNS queries. In this paper, we tackle this issue by proposing DNS‑GT, a novel Transformer‑based model that learns embeddings for domain names from sequences of DNS queries. The model is first pre‑trained in a self‑supervised fashion in order to learn the general behavior of DNS activity. Then, it can be finetuned on specific downstream tasks, exploiting interactions with other relevant queries in a given sequence. Our experiments with real‑world DNS data showcase the ability of our method to learn effective domain name representations. A quantitative evaluation on domain name classification and botnet detection tasks shows that our approach achieves better results compared to relevant baselines, creating opportunities for further exploration of large‑scale language models for intrusion detection systems. Our code is available at: https://github.com/m‑altieri/DNS‑GT.

Authors:Zeyuan Guo, Enmao Diao, Cheng Yang, Chuan Shi
Title: Graph Tokenization for Bridging Graphs and Transformers
Abstract:
The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph‑structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves state‑of‑the‑art results on 14 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph‑structured data and the ecosystem of sequence models. Our code is available at \hrefhttps://github.com/BUPT‑GAMMA/Graph‑Tokenization‑for‑Bridging‑Graphs‑and‑Transformers\colorbluehere.

Authors:Chandler Smith, Magnus Sesodia, Friedrich Lindenberg, Christian Schroeder de Witt
Title: OpenSanctions Pairs: Large-Scale Entity Matching with LLMs
Abstract:
We release OpenSanctions Pairs, a large‑scale entity matching benchmark derived from real‑world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross‑script names, noisy and missing attributes, and set‑valued fields typical of compliance workflows. We benchmark a production rule‑based matcher (nomenklatura RegressionV1 algorithm) against open‑ and closed‑source LLMs in zero‑ and few‑shot settings. Off‑the‑shelf LLMs substantially outperform the production rule‑based baseline (91.33% F1), reaching up to 98.95% F1 (GPT‑4o) and 98.23% F1 with a locally deployable open model (DeepSeek‑R1‑Distill‑Qwen‑14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in‑context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule‑based system over‑matches (high false positives), whereas LLMs primarily fail on cross‑script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty‑aware review. Code available at https://github.com/chansmi/OSINT_entity_resolution

Authors:Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette
Title: Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation
Abstract:
We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel‑wise 1D approximations that neglect lateral diffusion, and soft‑constrained Physics‑Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high‑resolution 3D tomography. Our discretize‑then‑optimize paradigm effectively mitigates the spectral bias and ill‑posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at https://cab‑lab‑princeton.github.io/nefty/

Authors:Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan
Title: V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Abstract:
Generating music that temporally aligns with video events is challenging for existing text‑to‑music models, which lack fine‑grained temporal control. We introduce V2M‑Zero, a zero‑pair video‑to‑music generation approach that outputs time‑aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra‑modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine‑tune a text‑to‑music model on music‑event curves, then substitute video‑event curves at inference without cross‑modal training or paired data. Across OES‑Pub, MovieGenBench‑Music, and AIST++, V2M‑Zero achieves substantial gains over paired‑data baselines: 5‑21% higher audio quality, 13‑15% better semantic alignment, 21‑52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd‑source subjective listening test. Overall, our results validate that temporal alignment through within‑modality features, rather than paired cross‑modal supervision, is effective for video‑to‑music generation. Results are available at https://genjib.github.io/v2m_zero/

Authors:Konrad Szafer, Marek Kraft, Dominik Belter
Title: Pointy - A Lightweight Transformer for Point Cloud Foundation Models
Abstract:
Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer‑based point cloud architecture. In contrast to the heavy reliance on cross‑modal supervision, our model is trained only on 39k point clouds ‑ yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state‑of‑the‑art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer‑free architectures. Our results show that simple backbones can deliver competitive results to more complex or data‑rich strategies. The implementation, including code, pre‑trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.

Authors:Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary
Title: Ranking Reasoning LLMs under Test-Time Scaling
Abstract:
Test‑time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test‑time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired‑comparison models, item response theory (IRT) models, voting rules, and graph‑ and spectral‑based methods. Across 20 reasoning models on four Olympiad‑style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to N=80 trials), most full‑trial rankings agree closely with the Bayesian gold standard \mathrmBayes_\mathcalU@80 (mean Kendall's τ_b = 0.93‑‑0.95), and 19‑‑34 methods recover exactly the same ordering. In the single‑trial regime, the best methods reach τ_b \approx 0.86. Using greedy decoding as an empirical prior (\mathrmBayes_\mathbfR_0@N) reduces variance at N=1 by 16‑‑52%, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high‑ and low‑budget test‑time scaling. We release Scorio as an open‑source library at https://github.com/mohsenhariri/scorio.

Authors:Rajdeep Pathak, Sayantee Jana
Title: Quantifying Membership Disclosure Risk for Tabular Synthetic Data Using Kernel Density Estimators
Abstract:
The use of synthetic data has become increasingly popular as a privacy‑preserving alternative to sharing real datasets, especially in sensitive domains such as healthcare, finance, and demography. However, the privacy assurances of synthetic data are not absolute, and remain susceptible to membership inference attacks (MIAs), where adversaries aim to determine whether a specific individual was present in the dataset used to train the generator. In this work, we propose a practical and effective method to quantify membership disclosure risk in tabular synthetic datasets using kernel density estimators (KDEs). Our KDE‑based approach models the distribution of nearest‑neighbour distances between synthetic data and the training records, allowing probabilistic inference of membership and enabling robust evaluation via ROC curves. We propose two attack models: a 'True Distribution Attack', which assumes privileged access to training data, and a more realistic, implementable 'Realistic Attack' that uses auxiliary data without true membership labels. Empirical evaluations across four real‑world datasets and six synthetic data generators demonstrate that our method consistently achieves higher F1 scores and sharper risk characterization than a prior baseline approach, without requiring computationally expensive shadow models. The proposed method provides a practical framework and metric for quantifying membership disclosure risk in synthetic data, which enables data custodians to conduct a post‑generation risk assessment prior to releasing their synthetic datasets for downstream use. The datasets and codes for this study are available at https://github.com/PyCoder913/MIA‑KDE.

Authors:Zegu Zhang, Jian Zhang
Title: Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors
Abstract:
Variational autoencoders (VAEs) frequently suffer from posterior collapse, where latent variables become uninformative and the approximate posterior degenerates to the prior. Recent work has characterized this phenomenon as a phase transition governed by the spectral properties of the data covariance matrix. In this paper, we propose a fundamentally different approach: instead of avoiding collapse through architectural constraints or hyperparameter tuning, we eliminate the possibility of collapse altogether by leveraging the multiplicity of Gaussian mixture model (GMM) clusterings. We introduce Historical Consensus Training, an iterative selection procedure that progressively refines a set of candidate GMM priors through alternating optimization and selection. The key insight is that models trained to satisfy multiple distinct clustering constraints develop a historical barrier ‑‑ a region in parameter space that remains stable even when subsequently trained with a single objective. We prove that this barrier excludes the collapsed solution, and demonstrate through extensive experiments on synthetic and real‑world datasets that our method achieves non‑collapsed representations regardless of decoder variance or regularization strength. Our approach requires no explicit stability conditions (e.g., σ^\prime 2 < λ_\max) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/historical‑consensus‑vae.

Authors:Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon
Title: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation
Abstract:
Transformer‑based large language models (LLMs) rely on key‑value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long‑context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real‑world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter‑efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long‑context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long‑context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time‑to‑first‑token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

Authors:Xinran Xu, Xiuyi Fan
Title: CUPID: A Plug-in Framework for Joint Aleatoric and Epistemic Uncertainty Estimation with a Single Model
Abstract:
Accurate estimation of uncertainty in deep learning is critical for deploying models in high‑stakes domains such as medical diagnosis and autonomous decision‑making, where overconfident predictions can lead to harmful outcomes. In practice, understanding the reason behind a model's uncertainty and the type of uncertainty it represents can support risk‑aware decisions, enhance user trust, and guide additional data collection. However, many existing methods only address a single type of uncertainty or require modifications and retraining of the base model, making them difficult to adopt in real‑world systems. We introduce CUPID (Comprehensive Uncertainty Plug‑in estImation moDel), a general‑purpose module that jointly estimates aleatoric and epistemic uncertainty without modifying or retraining the base model. CUPID can be flexibly inserted into any layer of a pretrained network. It models aleatoric uncertainty through a learned Bayesian identity mapping and captures epistemic uncertainty by analyzing the model's internal responses to structured perturbations. We evaluate CUPID across a range of tasks, including classification, regression, and out‑of‑distribution detection. The results show that it consistently delivers competitive performance while offering layer‑wise insights into the origins of uncertainty. By making uncertainty estimation modular, interpretable, and model‑agnostic, CUPID supports more transparent and trustworthy AI. Related code and data are available at https://github.com/a‑Fomalhaut‑a/CUPID.

Authors:Changyi Xiao, Caijun Xu, Yixin Cao
Title: Reinforcement Learning with Conditional Expectation Reward
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule‑based verifiers can be constructed. However, the reliance on handcrafted, domain‑specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free‑form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule‑based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.

Authors:Yuanbo Hou, Yanru Wu, Qiaoqiao Ren, Shengchen Li, Stephen Roberts, Dick Botteldooren
Title: Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context
Abstract:
Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio‑only recognition problem. This formulation leaves a persistent drawback in multi‑label audio tagging (AT): acoustic similarity can make certain events difficult to separate from waveforms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location‑tied environmental priors that can help reduce this ambiguity. A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo‑AT) task, which conditions multi‑label sound event tagging on GSC alongside audio. To benchmark Geo‑AT, Geo‑ATBench is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of audio across 28 event categories; each clip is paired with a GSC representation from 11 semantic context categories. GeoFusion‑AT is proposed as a unified geo‑audio fusion framework that evaluates feature‑, representation‑, and decision‑level fusion on representative audio backbones, with audio‑ and GSC‑only baselines. Results show that incorporating GSC improves AT performance, especially on acoustically confounded labels, indicating geospatial semantics provide effective priors beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between models on Geo‑ATBench labels and aggregated human labels, supporting Geo‑ATBench as a human‑aligned benchmark. The Geo‑AT task, benchmark Geo‑ATBench, and reproducible geo‑audio fusion framework GeoFusion‑AT provide a foundation for studying AT with geospatial semantic context within the CASA community. Dataset, code, models are on homepage (https://github.com/WuYanru2002/Geo‑ATBench).

Authors:Ivan Bioli, Mikel Mendibe Abarrategi
Title: Self-Scaled Broyden Family of Quasi-Newton Methods in JAX
Abstract:
We present a JAX implementation of the Self‑Scaled Broyden family of quasi‑Newton methods, fully compatible with JAX and building on the Optimistix~\citerader_optimistix_2024 optimisation library. The implementation includes BFGS, DFP, Broyden and their Self‑Scaled variants(SSBFGS, SSDFP, SSBroyden), together with a Zoom line search satisfying the strong Wolfe conditions. This is a short technical note, not a research paper, as it does not claim any novel contribution; its purpose is to document the implementation and ease the adoption of these optimisers within the JAX community. The code is available at https://github.com/IvanBioli/ssbroyden_optimistix.git.

Authors:Simon D. Nguyen, Troy Russo, Kentaro Hoffman, Tyler H. McCormick
Title: Adaptive Active Learning for Regression via Reinforcement Learning
Abstract:
Active learning for regression reduces labeling costs by selecting the most informative samples. Improved Greedy Sampling is a prominent method that balances feature‑space diversity and output‑space uncertainty using a static, multiplicative rule. We propose Weighted improved Greedy Sampling (WiGS), which replaces this framework with a dynamic, additive criterion. We formulate weight selection as a reinforcement learning problem, enabling an agent to adapt the exploration‑investigation balance throughout learning. Experiments on 18 benchmark datasets and a synthetic environment show WiGS outperforms iGS and other baseline methods in both accuracy and labeling efficiency, particularly in domains with irregular data density where the baseline's multiplicative rule ignores high‑error samples in dense regions.

Authors:Tongcheng Zhang, Zhanpeng Zhou, Mingze Wang, Andi Han, Wei Huang, Taiji Suzuki, Junchi Yan
Title: On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD
Abstract:
One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient‑based training algorithms. Motivated by empirical observations that training with noisy labels improves model generalization, we delve into the underlying mechanisms behind stochastic gradient descent (SGD) with label noise. Focusing on a two‑layer over‑parameterized linear network, we analyze the learning dynamics of label noise SGD, unveiling a two‑phase learning behavior. In \emphPhase I, the magnitudes of model weights progressively diminish, and the model escapes the lazy regime; enters the rich regime. In \emphPhase II, the alignment between model weights and the ground‑truth interpolator increases, and the model eventually converges. Our analysis highlights the critical role of label noise in driving the transition from the lazy to the rich regime and minimally explains its empirical success. Furthermore, we extend these insights to Sharpness‑Aware Minimization (SAM), showing that the principles governing label noise SGD also apply to broader optimization algorithms. Extensive experiments, conducted under both synthetic and real‑world setups, strongly support our theory. Our code is released at https://github.com/a‑usually/Label‑Noise‑SGD.

Authors:Chen-Chen Zong, Sheng-Jun Huang
Title: Federated Active Learning Under Extreme Non-IID and Global Class Imbalance
Abstract:
Federated active learning (FAL) seeks to reduce annotation cost under privacy constraints, yet its effectiveness degrades in realistic settings with severe global class imbalance and highly heterogeneous clients. We conduct a systematic study of query‑model selection in FAL and uncover a central insight: the model that achieves more class‑balanced sampling, especially for minority classes, consistently leads to better final performance. Moreover, global‑model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable. Based on these findings, we propose FairFAL, an adaptive class‑fair FAL framework. FairFAL (1) infers global imbalance and local‑global divergence via lightweight prediction discrepancy, enabling adaptive selection between global and local query models; (2) performs prototype‑guided pseudo‑labeling using global features to promote class‑aware querying; and (3) applies a two‑stage uncertainty‑diversity balanced sampling strategy with k‑center refinement. Experiments on five benchmarks show that FairFAL consistently outperforms state‑of‑the‑art approaches under challenging long‑tailed and non‑IID settings. The code is available at https://github.com/chenchenzong/FairFAL.

Authors:Zhanyi Sun, Shuran Song
Title: From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning
Abstract:
We introduce Distribution Contractive Reinforcement Learning (DICE‑RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE‑RL turns a pretrained behavior prior into a high‑performing "pro" policy by amplifying high‑success behaviors from online feedback. We pretrain a diffusion‑ or flow‑based policy for broad behavioral coverage, then finetune it with a stable, sample‑efficient residual off‑policy RL framework that combines selective behavior regularization with value‑guided action selection. Extensive experiments and analyses show that DICE‑RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long‑horizon manipulation skills directly from high‑dimensional pixel inputs, both in simulation and on a real robot. Project website: https://zhanyisun.github.io/dice.rl.2026/.

Authors:Davide Tugnoli, Andrea De Lorenzo, Marco Virgolin, Giovanni Cinà
Title: Improving TabPFN's Synthetic Data Generation by Integrating Causal Structure
Abstract:
Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior‑Data Fitted Network (TabPFN), a recent foundation model for tabular data, has been shown capable of generating high‑quality synthetic tabular data. However, TabPFN is autoregressive: features are generated sequentially by conditioning on the previous ones, depending on the order in which they appear in the input data. We demonstrate that when the feature order conflicts with causal structure, the model produces spurious correlations that impair its ability to generate synthetic data and preserve causal effects. We address this limitation by integrating causal structure into TabPFN's generation process through two complementary approaches: Directed Acyclic Graph (DAG)‑aware conditioning, which samples each variable given its causal parents, and a Completed Partially Directed Acyclic Graph (CPDAG)‑based strategy for scenarios with partial causal knowledge. We evaluate these approaches on controlled benchmarks and six CSuite datasets, assessing structural fidelity, distributional alignment, privacy preservation, and Average Treatment Effect (ATE) preservation. Across most settings, DAG‑aware conditioning improves the quality and stability of synthetic data relative to vanilla TabPFN. The CPDAG‑based strategy shows moderate improvements, with effectiveness depending on the number of oriented edges. These results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.

Authors:Xiaoyan Zhang, Jiangpeng He
Title: One Adapter for All: Towards Unified Representation in Step-Imbalanced Class-Incremental Learning
Abstract:
Class‑incremental learning (CIL) aims to acquire new classes over time while retaining prior knowledge, yet most setups and methods assume balanced task streams. In practice, the number of classes per task often varies significantly. We refer to this as step imbalance, where large tasks that contain more classes dominate learning and small tasks inject unstable updates. Existing CIL methods assume balanced tasks and therefore treat all tasks uniformly, producing imbalanced updates that degrade overall learning performance. To address this challenge, we propose One‑A, a unified and imbalance‑aware framework that incrementally merges task updates into a single adapter, maintaining constant inference cost. One‑A performs asymmetric subspace alignment to preserve dominant subspaces learned from large tasks while constraining low‑information updates within them. An information‑adaptive weighting balances the contribution between base and new adapters, and a directional gating mechanism selectively fuses updates along each singular direction, maintaining stability in head directions and plasticity in tail ones. Across multiple benchmarks and step‑imbalanced streams, One‑A achieves competitive accuracy with significantly low inference overhead, showing that a single, asymmetrically fused adapter can remain both adaptive to dynamic task sizes and efficient at deployment.

Authors:Ortal Reshef, Ofer Glassman, Or Zuk, Yariv Aizenbud, Boaz Nadler, Ariel Jaffe
Title: SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction
Abstract:
Recovering a tree that represents the evolutionary history of a group of species is a key task in phylogenetics. Performing this task using sequence data from multiple genetic markers poses two key challenges. The first is the discordance between the evolutionary history of individual genes and that of the species. The second challenge is computational, as contemporary studies involve thousands of species. Here we present SDSR, a scalable divide‑and‑conquer approach for species tree reconstruction based on spectral graph theory. The algorithm recursively partitions the species into subsets until their sizes are below a given threshold. The trees of these subsets are reconstructed by a user‑chosen species tree algorithm. Finally, these subtrees are merged to form the full tree. On the theoretical front, we derive recovery guarantees for SDSR, under the multispecies coalescent (MSC) model. We also perform a runtime complexity analysis. We show that SDSR, when combined with a species tree reconstruction algorithm as a subroutine, yields substantial runtime savings as compared to applying the same algorithm on the full data. Empirically, we evaluate SDSR on synthetic benchmark datasets with incomplete lineage sorting and horizontal gene transfer. In accordance with our theoretical analysis, the simulations show that combining SDSR with common species tree methods, such as CA‑ML or ASTRAL, yields up to 10‑fold faster runtimes. In addition, SDSR achieves a comparable tree reconstruction accuracy to that obtained by applying these methods on the full data.

Authors:Chujie Chang, Shoko Miyauchi, Ken'ichi Morooka, Ryo Kurazume, Oscar Martinez Mozos
Title: FusionNet: a frame interpolation network for 4D heart models
Abstract:
Cardiac magnetic resonance (CMR) imaging is widely used to visualise cardiac motion and diagnose heart disease. However, standard CMR imaging requires patients to lie still in a confined space inside a loud machine for 40‑60 min, which increases patient discomfort. In addition, shorter scan times decrease either or both the temporal and spatial resolutions of cardiac motion, and thus, the diagnostic accuracy of the procedure. Of these, we focus on reduced temporal resolution and propose a neural network called FusionNet to obtain four‑dimensional (4D) cardiac motion with high temporal resolution from CMR images captured in a short period of time. The model estimates intermediate 3D heart shapes based on adjacent shapes. The results of an experimental evaluation of the proposed FusionNet model showed that it achieved a performance of over 0.897 in terms of the Dice coefficient, confirming that it can recover shapes more precisely than existing methods. This code is available at: https://github.com/smiyauchi199/FusionNet.git

Authors:Deyi Li, Zijun Yao, Qi Xu, Muxuan Liang, Lingyao Li, Zijian Xu, Mei Liu
Title: DT-BEHRT: Disease Trajectory-aware Transformer for Interpretable Patient Representation Learning
Abstract:
The growing adoption of electronic health record (EHR) systems has provided unprecedented opportunities for predictive modeling to guide clinical decision making. Structured EHRs contain longitudinal observations of patients across hospital visits, where each visit is represented by a set of medical codes. While sequence‑based, graph‑based, and graph‑enhanced sequence approaches have been developed to capture rich code interactions over time or within the same visits, they often overlook the inherent heterogeneous roles of medical codes arising from distinct clinical characteristics and contexts. To this end, in this study we propose the Disease Trajectory‑aware Transformer for EHR (DT‑BEHRT), a graph‑enhanced sequential architecture that disentangles disease trajectories by explicitly modeling diagnosis‑centric interactions within organ systems and capturing asynchronous progression patterns. To further enhance the representation robustness, we design a tailored pre‑training methodology that combines trajectory‑level code masking with ontology‑informed ancestor prediction, promoting semantic alignment across multiple modeling modules. Extensive experiments on multiple benchmark datasets demonstrate that DT‑BEHRT achieves strong predictive performance and provides interpretable patient representations that align with clinicians' disease‑centered reasoning. The source code is publicly accessible at https://github.com/GatorAIM/DT‑BEHRT.git.

Authors:Sofia Maria Lo Cicero Vaina, Artem Chumachenko, Max Ryabinin
Title: Mashup Learning: Faster Finetuning by Remixing Past Checkpoints
Abstract:
Finetuning on domain‑specific data is a well‑established method for enhancing LLM performance on downstream tasks. Training on each dataset produces a new set of model weights, resulting in a multitude of checkpoints saved in‑house or on open‑source platforms. However, these training artifacts are rarely reused for subsequent experiments despite containing improved model abilities for potentially similar tasks. In this paper, we propose Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks. Our procedure identifies the most relevant historical checkpoints for a target dataset, aggregates them with model merging, and uses the result as an improved initialization for training. Across 8 standard LLM benchmarks, four models, and two collections of source checkpoints, Mashup Learning consistently improves average downstream accuracy by 0.5‑5 percentage points over training from scratch. It also accelerates convergence, requiring 41‑46% fewer training steps and up to 37% less total wall‑clock time to match from‑scratch accuracy, including all selection and merging overhead.

Authors:Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang
Title: CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process‑wrong but outcome‑correct rollouts can lead to hallucination and answer‑copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross‑trajectory regularization than the original single‑path supervision in RLVR, effectively mitigating step‑level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen‑Applications/CLIPO.

Authors:Yuze Dong, Jinsong Wu
Title: Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts
Abstract:
Time‑series forecasting often faces challenges from non‑stationarity, particularly distributional drift, where the data distribution evolves over time. This dynamic behavior can undermine the effectiveness of adaptive optimizers, such as Adam, which are typically designed for stationary objectives. In this paper, we revisit Adam in the context of non‑stationary forecasting and identify that its second‑order bias correction limits responsiveness to shifting loss landscapes. To address this, we propose TS_Adam, a lightweight variant that removes the second‑order correction from the learning rate computation. This simple modification improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters. TS_Adam integrates easily into existing models and consistently improves performance across long‑ and short‑term forecasting tasks. On the ETT datasets with the MICN model, it achieves an average reduction of 12.8% in MSE and 5.7% in MAE compared to Adam. These results underscore the practicality and versatility of TS_Adam as an effective optimization strategy for real‑world forecasting scenarios involving non‑stationary data. Code is available at: https://github.com/DD‑459‑1/TS_Adam.

Authors:Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, Ferrante Neri
Title: A Survey of Weight Space Learning: Understanding, Representation, and Generation
Abstract:
Neural network weights are typically viewed as the end product of training, while most deep learning research focuses on data, features, and architectures. However, recent advances show that the set of all possible weight values (weight space) itself contains rich structure: pretrained models form organized distributions, exhibit symmetries, and can be embedded, compared, or even generated. Understanding such structures has tremendous impact on how neural networks are analyzed and compared, and on how knowledge is transferred across models, beyond individual training instances. This emerging research direction, which we refer to as Weight Space Learning (WSL), treats neural weights as a meaningful domain for analysis and modeling. This survey provides the first unified taxonomy of WSL. We categorize existing methods into three core dimensions: Weight Space Understanding (WSU), which studies the geometry and symmetries of weights; Weight Space Representation (WSR), which learns embeddings over model weights; and Weight Space Generation (WSG), which synthesizes new weights through hypernetworks or generative models. We further show how these developments enable practical applications, including model retrieval, continual and federated learning, neural architecture search, and data‑free reconstruction. By consolidating fragmented progress under a coherent framework, this survey highlights weight space as a learnable, structured domain with growing impact across model analysis, transferring, and weight generation. We release an accompanying resource at https://github.com/Zehong‑Wang/Awesome‑Weight‑Space‑Learning.

Authors:Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, Yang Liu
Title: KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
Abstract:
Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM‑based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial‑and‑error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge‑driven and aware of task trajectories. Specifically, we present KernelSkill, a multi‑agent framework with a dual‑level memory architecture. KernelSkill operates by coordinating agents with long‑term memory of reusable expert skills and short‑term memory to prevent repetitive backtracking. On KernelBench Levels 1‑3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.

Authors:Ammar Daskin
Title: Mitigating Frequency Learning Bias in Quantum Models via Multi-Stage Residual Learning
Abstract:
Quantum machine learning models based on parameterized circuits can be viewed as Fourier series approximators. However, they often struggle to learn functions with multiple frequency components, particularly high‑frequency or non‑dominant ones; a phenomenon we term the quantum Fourier parameterization bias. Inspired by recent advances in classical Fourier neural operators (FNOs), we adapt the multi‑stage residual learning idea to the quantum domain, iteratively training additional quantum modules on the residuals of previous stages. We evaluate our method on a synthetic benchmark composed of spatially localized frequency components with diverse envelope shapes (Gaussian, Lorentzian, triangular). Systematic experiments show that the number of qubits, the encoding scheme, and residual learning are all crucial for resolving multiple frequencies; residual learning alone can improve test MSE significantly over a single‑stage baseline trained for the same total number of epochs. Our work provides a practical framework for enhancing the spectral expressivity of quantum models and offers new insights into their frequency‑learning behavior.

Authors:Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang
Title: HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
Abstract:
Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy‑tailed weight spectra and over‑emphasizes the training along noise‑dominated directions. Motivated by the Heavy‑Tailed Self‑Regularization (HT‑SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier‑tailed updates and inducing heavier‑tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state‑of‑the‑art baselines and can also serve as a plug‑in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to 0.98 compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten‑q norm constraint and provide convergence analysis in smooth non‑convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

Authors:Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal
Title: Training Language Models via Neural Cellular Automata
Abstract:
Pre‑training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre‑training has problems: high‑quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non‑linguistic data for pre‑pre‑training LLMs‑‑training on synthetic‑then‑natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre‑pre‑training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre‑pre‑training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench‑Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre‑training.

Authors:Yunzhou Song, Long Le, Yong-Hyun Park, Jie Wang, Junyao Shi, Lingjie Liu, Jiatao Gu, Eric Eaton, Dinesh Jayaraman, Kostas Daniilidis
Title: OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies
Abstract:
Vision‑language‑action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic‑reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task‑specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OMNIGUIDE enables guidance sources with complementary task‑relevant strengths to improve a VLA model's performance on challenging tasks. Extensive experiments in both simulation and real‑world environments, across diverse sources of guidance, demonstrate that OMNIGUIDE enhances the performance of state‑of‑the‑art generalist policies (e.g., π_0.5, GR00T N1.6) significantly across success and safety rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies. Project Page: \hrefhttps://omniguide.github.io/this \; url

Authors:Eric Roginek, Jingyan Xu, D. Frank. Hsu
Title: InFusionLayer: a CFA-based ensemble tool to generate new classifiers for learning and modeling
Abstract:
Ensemble learning is a well established body of methods for machine learning to enhance predictive performance by combining multiple algorithms/models. Combinatorial Fusion Analysis (CFA) has provided method and practice for combining multiple scoring systems, using rank‑score characteristic (RSC) function and cognitive diversity (CD), including ensemble method and model fusion. However, there is no general‑purpose Python tool available that incorporate these techniques. In this paper we introduce \textttInFusionLayer, a machine learning architecture inspired by CFA at the system fusion level that uses a moderate set of base models to optimize unsupervised and supervised learning multiclassification problems. We demonstrate \textttInFusionLayer's ease of use for PyTorch, TensorFlow, and Scikit‑learn workflows by validating its performance on various computer vision datasets. Our results highlight the practical advantages of incorporating distinctive features of RSC function and CD, paving the way for more sophisticated ensemble learning applications in machine learning. We open‑sourced our code to encourage continuing development and community accessibility to leverage CFA on github: https://github.com/ewroginek/Infusion

Authors:David Gringras
Title: Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety
Abstract:
Safety benchmarks evaluate language models in isolation, typically using multiple‑choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre‑registration, assessor blinding, equivalence testing, and specification curve analysis. Map‑reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map‑reduce degradation revealed a deeper measurement problem: switching from multiple‑choice to open‑ended format on identical items shifts safety scores by 5‑20 percentage points, larger than any scaffold effect. Within‑format scaffold comparisons are consistent with practical equivalence under our pre‑registered +/‑2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by ‑16.8 pp on sycophancy under map‑reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non‑zero reliability, making per‑model, per‑configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.

Authors:Shubham Kumar Singh
Title: HTM-EAR: Importance-Preserving Tiered Memory with Hybrid Routing under Saturation
Abstract:
Memory constraints in long‑running agents require structured management of accumulated facts while preserving essential information under bounded context limits. We introduce HTM‑EAR, a hierarchical tiered memory substrate that integrates HNSW‑based working memory (L1) with archival storage (L2), combining importance‑aware eviction and hybrid routing. When L1 reaches capacity, items are evicted using a weighted score of importance and usage. Queries are first resolved in L1; if similarity or entity coverage is insufficient, retrieval falls back to L2, and candidates are re‑ranked using a cross‑encoder. We evaluate the system under sustained saturation (15,000 facts; L1 capacity 500; L2 capacity 5000) using synthetic streams across five random seeds and real BGL system logs. Ablation studies compare the full system against variants without cross‑encoder re‑ranking, without routing gates, with LRU eviction, and an oracle with unbounded memory. Under saturation, the full model preserves active‑query precision (MRR = 1.000) while enabling controlled forgetting of stale history, approaching oracle active performance (0.997 +/‑ 0.003). In contrast, LRU minimizes latency (21.1 ms) but permanently evicts 2416 essential facts. On BGL logs, the full system achieves MRR 0.336, close to the oracle (0.370), while LRU drops to 0.069. Code is publicly available at: https://github.com/shubham‑61291/HTM‑EAR

Authors:Izzat Alsmadi, Anas Alsobeh
Title: TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment
Abstract:
This paper presents TAMUSA‑Chat, a research‑oriented framework for building domain‑adapted large language model conversational systems. The work addresses critical challenges in adapting general‑purpose foundation models to institutional contexts through supervised fine‑tuning, retrieval‑augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper‑parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine‑tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality‑cost trade‑offs. The publicly available codebase at https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.

Authors:Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye
Title: MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios
Abstract:
Mixture‑of‑Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low‑information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE‑SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE‑SpAc achieves a 42% improvement in TPS over the SOTA SD‑based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE‑SpAc .

Authors:Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano
Title: From Data Statistics to Feature Geometry: How Correlations Shape Superposition
Abstract:
A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over‑complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non‑linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag‑of‑Words Superposition (BOWS), a controlled setting to encode binary bag‑of‑words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co‑activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations‑feature‑geometry.

Authors:Fredrik K. Gustafsson, Xiao Gu, Mattia Carletti, Patitapaban Palo, David W. Eyre, David A. Clifton
Title: SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
Abstract:
Recent biosignal foundation models (FMs) have demonstrated promising performance across diverse clinical prediction tasks, yet systematic evaluation on long‑duration multimodal data remains limited. We introduce SignalMC‑MED, a benchmark for evaluating biosignal FMs on synchronized single‑lead electrocardiogram (ECG) and photoplethysmogram (PPG) data. Derived from the MC‑MED dataset, SignalMC‑MED comprises 22,256 visits with 10‑minute overlapping ECG and PPG signals, and includes 20 clinically relevant tasks spanning prediction of demographics, emergency department disposition, laboratory value regression, and detection of prior ICD‑10 diagnoses. Using this benchmark, we perform a systematic evaluation of representative time‑series and biosignal FMs across ECG‑only, PPG‑only, and ECG + PPG settings. We find that domain‑specific biosignal FMs consistently outperform general time‑series models, and that multimodal ECG + PPG fusion yields robust improvements over unimodal inputs. Moreover, using the full 10‑minute signal consistently outperforms shorter segments, and larger model variants do not reliably outperform smaller ones. Hand‑crafted ECG domain features provide a strong baseline and offer complementary value when combined with learned FM representations. Together, these results establish SignalMC‑MED as a standardized benchmark and provide practical guidance for evaluating and deploying biosignal FMs.

Authors:Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause
Title: ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low‑resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine‑tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high‑quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one‑sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

Authors:Boya Zhang, Shuaijie Yin, Huiwen Zhu, Xing He
Title: FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting
Abstract:
Mining time‑frequency features is critical for time series forecasting. Existing research has predominantly focused on modeling low‑frequency patterns, where most time series energy is concentrated. The overlooking of mid to high frequency continues to limit further performance gains in deep learning models. We propose FreqCycle, a novel framework integrating: (i) a Filter‑Enhanced Cycle Forecasting (FECF) module to extract low‑frequency features by explicitly learning shared periodic patterns in the time domain, and (ii) a Segmented Frequency‑domain Pattern Learning (SFPL) module to enhance mid to high frequency energy proportion via learnable filters and adaptive weighting. Furthermore, time series data often exhibit coupled multi‑periodicity, such as intertwined weekly and daily cycles. To address coupled multi‑periodicity as well as long lookback window challenges, we extend FreqCycle hierarchically into MFreqCycle, which decouples nested periodic features through cross‑scale interactions. Extensive experiments on seven diverse domain benchmarks demonstrate that FreqCycle achieves state‑of‑the‑art accuracy while maintaining faster inference speeds, striking an optimal balance between performance and efficiency.

Authors:Elisabeth Sommer James, Asger Hobolth, Marta Pelizzola
Title: MM-algorithms for traditional and convex NMF with Tweedie and Negative Binomial cost functions and empirical evaluation
Abstract:
Non‑negative matrix factorisation (NMF) is a widely used tool for unsupervised learning and feature extraction, with applications ranging from genomics to text analysis and signal processing. Standard formulations of NMF are typically derived under Gaussian or Poisson noise assumptions, which may be inadequate for data exhibiting overdispersion or other complex mean‑variance relationships. In this paper, we develop a unified framework for both traditional and convex NMF under a broad class of distributional assumptions, including Negative Binomial and Tweedie models, where the connection between the Tweedie and the β‑divergence is also highlighted. Using a Majorize‑Minimisation approach, we derive multiplicative update rules for all considered models, and novel updates for convex NMF with Poisson and Negative Binomial cost functions. We provide a unified implementation of all considered models, including the first implementations of several convex NMF models. Empirical evaluations on mutational and word count data demonstrate that the choice of noise model critically affects model fit and feature recovery, and that convex NMF can provide an efficient and robust alternative to traditional NMF in scenarios where the number of classes is large. The code for our proposed updates is available in the R package nmfgenr and can be found at https://github.com/MartaPelizzola/nmfgenr.

Authors:Cosmo Santoni
Title: Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference
Abstract:
State‑space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba‑2's state space duality algorithm ‑‑ diagonal state structure, chunkable recurrence, and einsum‑dominated compute with static control flow ‑‑ maps cleanly onto what XLA's fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand‑written kernels, and realise the architecture's theoretical O(1) state management as a compiled on‑device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M‑‑2.7B parameters), XLA‑generated code reaches approximately 140 TFLOPS on single‑stream prefill (15% MFU) and up to 64% bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token‑for‑token across 64 steps, with hidden‑state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at https://github.com/CosmoNaught/mamba2‑jax and merged into the Bonsai JAX model library.

Authors:Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Yuhao Chen, Qingyu Zhang, Jixiang Luo, Xuelong Li, Rongrong Ji
Title: Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation
Abstract:
Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine‑tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter‑ and data‑efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target‑specific output distributions separately, enabling parameter‑efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine‑tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high‑value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine‑tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn‑Lucy/Efficient‑Draft‑Adaptation.

Authors:Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann Weber, Vinicius Azevedo
Title: Reviving ConvNeXt for Efficient Convolutional Diffusion Models
Abstract:
Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness‑‑the attributes that established ConvNets as the efficient vision backbone‑‑have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT‑XL/2, FCDM‑XL achieves competitive performance with 7× and 7.5× fewer training steps at 256×256 and 512×512 resolutions, respectively. Remarkably, FCDM‑XL can be trained on a 4‑GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.

Authors:MoonJeong Park, Seungbeom Lee, Kyungmin Kim, Jaeseung Heo, Seunghyuk Cho, Shouheng Li, Sangdon Park, Dongwoo Kim
Title: Transductive Generalization via Optimal Transport and Its Application to Graph Node Classification
Abstract:
Many existing transductive bounds rely on classical complexity measures that are computationally intractable and often misaligned with empirical behavior. In this work, we establish new representation‑based generalization bounds in a distribution‑free transductive setting, where learned representations are dependent, and test features are accessible during training. We derive global and class‑wise bounds via optimal transport, expressed in terms of Wasserstein distances between encoded feature distributions. We demonstrate that our bounds are efficiently computable and strongly correlate with empirical generalization in graph node classification, improving upon classical complexity measures. Additionally, our analysis reveals how the GNN aggregation process transforms the representation distributions, inducing a trade‑off between intra‑class concentration and inter‑class separation. This yields depth‑dependent characterizations that capture the non‑monotonic relationship between depth and generalization error observed in practice. The code is available at https://github.com/ml‑postech/Transductive‑OT‑Gen‑Bound.

Authors:Runyao Yu, Viviana Kleine, Philipp Gromotka, Thomas Rudolf, Adrian Eisenmann, Gautham Ram Chandra Mouli, Peter Palensky, Jochen L. Cremer
Title: Probabilistic Hysteresis Factor Prediction for Electric Vehicle Batteries with Graphite Anodes Containing Silicon
Abstract:
Batteries with silicon‑graphite‑based anodes, which offer higher energy density and improved charging performance, introduce pronounced voltage hysteresis, making state‑of‑charge (SoC) estimation particularly challenging. Existing approaches to modeling hysteresis rely on exhaustive high‑fidelity tests or focus on conventional graphite‑based lithium‑ion batteries, without considering uncertainty quantification or computational constraints. This work introduces a data‑driven approach for probabilistic hysteresis factor prediction, with a particular emphasis on applications involving silicon‑graphite anode‑based batteries. A data harmonization framework is proposed to standardize heterogeneous driving cycles across varying operating conditions. Statistical learning and deep learning models are applied to assess performance in predicting the hysteresis factor with uncertainties while considering computational efficiency. Extensive experiments are conducted to evaluate the generalizability of the optimal model configuration in unseen vehicle models through retraining, zero‑shot prediction, fine‑tuning, and joint training. By addressing key challenges in SoC estimation, this research facilitates the adoption of advanced battery technologies. A summary page is available at: https://runyao‑yu.github.io/Porsche_Hysteresis_Factor_Prediction/

Authors:Wei Feng, Jingbo Zhang, Qiong Wu, Pingyi Fan, Qiang Fan
Title: PPO-Based Hybrid Optimization for RIS-Assisted Semantic Vehicular Edge Computing
Abstract:
To support latency‑sensitive Internet of Vehicles (IoV) applications amidst dynamic environments and intermittent links, this paper proposes a Reconfigurable Intelligent Surface (RIS)‑aided semantic‑aware Vehicle Edge Computing (VEC) framework. This approach integrates RIS to optimize wireless connectivity and semantic communication to minimize latency by transmitting semantic features. We formulate a comprehensive joint optimization problem by optimizing offloading ratios, the number of semantic symbols, and RIS phase shifts. Considering the problem's high dimensionality and non‑convexity, we propose a two‑tier hybrid scheme that employs Proximal Policy Optimization (PPO) for discrete decision‑making and Linear Programming (LP) for offloading optimization. The simulation results have validated the proposed framework's superiority over existing methods. Specifically, the proposed PPO‑based hybrid optimization scheme reduces the average end‑to‑end latency by approximately 40% to 50% compared to Genetic Algorithm (GA) and Quantum‑behaved Particle Swarm Optimization (QPSO). Moreover, the system demonstrates strong scalability by maintaining low latency even in congested scenarios with up to 30 vehicles.

Authors:Pranav Mantini, Shishir K. Shah
Title: BiCLIP: Domain Canonicalization via Structured Geometric Transformation
Abstract:
Recent advances in vision‑language models (VLMs) have demonstrated remarkable zero‑shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few‑shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross‑modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state‑of‑the‑art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

Authors:Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri
Title: MASEval: Extending Multi-Agent Evaluation from Models to Systems
Abstract:
The rapid adoption of LLM‑based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model‑centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework‑agnostic library that treats the entire system as the unit of analysis. Through a systematic system‑level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.

Authors:Azul Garza, Renée Rosillo, Rodrigo Mendoza-Smith, David Salinas, Andrew Robert Williams, Arjun Ashok, Mononito Goswami, José Martín Juárez
Title: Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting
Abstract:
Recent advances in time‑series forecasting increasingly rely on pre‑trained foundation‑style models. While these models often claim broad generalization, existing evaluation protocols provide limited evidence. Indeed, most current benchmarks use static train‑test splits that can easily lead to contamination as foundation models can inadvertently train on test data or perform model selection using test scores, which can inflate performance. We introduce Impermanent, a live benchmark that evaluates forecasting models under open‑world temporal change by scoring forecasts sequentially over time on continuously updated data streams, enabling the study of temporal robustness, distributional shift, and performance stability rather than one‑off accuracy on a frozen test set. Impermanent is instantiated on GitHub open‑source activity, providing a naturally live and highly non‑stationary dataset shaped by releases, shifting contributor behavior, platform/tooling changes, and external events. We focus on the top 400 repositories by star count and construct time series from issues opened, pull requests opened, push events, and new stargazers, evaluated over a rolling window with daily updates, alongside standardized protocols and leaderboards for reproducible, ongoing comparison. By shifting evaluation from static accuracy to sustained performance, Impermanent takes a concrete step toward assessing when and whether foundation‑level generalization in time‑series forecasting can be meaningfully claimed. Code and a live dashboard are available at https://github.com/TimeCopilot/impermanent and https://impermanent.timecopilot.dev.

Authors:Yehonatan Elisha, Oren Barkan, Noam Koenigstein
Title: Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
Abstract:
Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground‑background masks, which fail to capture the fine‑grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept‑level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class‑relevant concepts are first proposed using an LLM‑based, label‑free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out‑of‑distribution benchmarks demonstrate that our method improves robustness across multiple ViT‑based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept‑guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.

Authors:Michael Kösel, Marcel Schreiber, Michael Ulrich, Claudius Gläser, Klaus Dietmayer
Title: ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection
Abstract:
LiDAR‑based 3D object detection plays a critical role for reliable and safe autonomous driving systems. However, existing detectors often produce overly confident predictions for objects not belonging to known categories, posing significant safety risks. This is caused by so‑called out‑of‑distribution (OOD) objects, which were not part of the training data, resulting in incorrect predictions. To address this challenge, we propose ALOOD (Aligned LiDAR representations for Out‑Of‑Distribution Detection), a novel approach that incorporates language representations from a vision‑language model (VLM). By aligning the object features from the object detector to the feature space of the VLM, we can treat the detection of OOD objects as a zero‑shot classification task. We demonstrate competitive performance on the nuScenes OOD benchmark, establishing a novel approach to OOD object detection in LiDAR using language representations. The source code is available at https://github.com/uulm‑mrm/mmood3d.

Authors:Lukas König, Manuel Kuhn, David Kappel, Anand Subramoney
Title: Training event-based neural networks with exact gradients via Differentiable ODE Solving in JAX
Abstract:
Existing frameworks for gradient‑based training of spiking neural networks face a trade‑off: discrete‑time methods using surrogate gradients support arbitrary neuron models but introduce gradient bias and constrain spike‑time resolution, while continuous‑time methods that compute exact gradients require analytical expressions for spike times and state evolution, restricting them to simple neuron types such as Leaky Integrate and Fire (LIF). We introduce the Eventax framework, which resolves this trade‑off by combining differentiable numerical ODE solvers with event‑based spike handling. Built in JAX, our frame‑work uses Diffrax ODE‑solvers to compute gradients that are exact with respect to the forward simulation for any neuron model defined by ODEs . It also provides a simple API where users can specify just the neuron dynamics, spike conditions, and reset rules. Eventax prioritises modelling flexibility, supporting a wide range of neuron models, loss functions, and network architectures, which can be easily extended. We demonstrate Eventax on multiple benchmarks, including Yin‑Yang and MNIST, using diverse neuron models such as Leaky Integrate‑and‑fire (LIF), Quadratic Integrate‑and‑fire (QIF), Exponential integrate‑and‑fire (EIF), Izhikevich and Event‑based Gated Recurrent Unit (EGRU) with both time‑to‑first‑spike and state‑based loss functions, demonstrating its utility for prototyping and testing event‑based architectures trained with exact gradients. We also demonstrate the application of this framework for more complex neuron types by implementing a multi‑compartment neuron that uses a model of dendritic spikes in human layer 2/3 cortical Pyramidal neurons for computation. Code available at https://github.com/efficient‑scalable‑machine‑learning/eventax.

Authors:Divake Kumar, Sina Tayebati, Devashri Naik, Patrick Poggi, Amanda Sofie Rios, Nilesh Ahuja, Amit Ranjan Trivedi
Title: TRIAGE: Type-Routed Interventions via Aleatoric-Epistemic Gated Estimation in Robotic Manipulation and Adaptive Perception -- Don't Treat All Uncertainty the Same
Abstract:
Most uncertainty‑aware robotic systems collapse prediction uncertainty into a single scalar score and use it to trigger uniform corrective responses. This aggregation obscures whether uncertainty arises from corrupted observations or from mismatch between the learned model and the true system dynamics. As a result, corrective actions may be applied to the wrong component of the closed loop, degrading performance relative to leaving the policy unchanged. We introduce a lightweight post hoc framework that decomposes uncertainty into aleatoric and epistemic components and uses these signals to regulate system responses at inference time. Aleatoric uncertainty is estimated from deviations in the observation distribution using a Mahalanobis density model, while epistemic uncertainty is detected using a noise robust forward dynamics ensemble that isolates model mismatch from measurement corruption. The two signals remain empirically near orthogonal during closed loop execution and enable type specific responses. High aleatoric uncertainty triggers observation recovery, while high epistemic uncertainty moderates control actions. The same signals also regulate adaptive perception by guiding model capacity selection during tracking inference. Experiments demonstrate consistent improvements across both control and perception tasks. In robotic manipulation, the decomposed controller improves task success from 59.4% to 80.4% under compound perturbations and outperforms a combined uncertainty baseline by up to 21.0%. In adaptive tracking inference on MOT17, uncertainty‑guided model selection reduces average compute by 58.2% relative to a fixed high capacity detector while preserving detection quality within 0.4%. Code and demo videos are available at https://divake.github.io/uncertainty‑decomposition/.

Authors:Zhongjian Qiao, Jiafei Lyu, Boxiang Lyu, Yao Shu, Siyang Gao, Shuang Qiu
Title: Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting
Abstract:
Model‑based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, model exploitation could occur due to inevitable model errors, degrading algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation. Within such a paradigm, RAMBO~\citeprigter2022rambo has emerged as a representative and most popular method that provides a practical implementation with model gradient. However, we empirically reveal that severe Q‑value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose RObust value‑aware Model learning with Implicitly differentiable adaptive weighting (ROMI). Instead of updating the dynamics model with model gradient, ROMI introduces a novel robust value‑aware model learning approach. This approach requires the dynamics model to predict future states with values close to the minimum Q‑value within a scale‑adjustable state uncertainty set, enabling controllable conservatism and stable model updates. To further improve out‑of‑distribution (OOD) generalization during multi‑step rollouts, we propose implicitly differentiable adaptive weighting, a bi‑level optimization scheme that adaptively achieves dynamics‑ and value‑aware model learning. Empirical results on D4RL and NeoRL datasets show that ROMI significantly outperforms RAMBO and achieves competitive or superior performance compared to other state‑of‑the‑art methods on datasets where RAMBO typically underperforms. Code is available at https://github.com/zq2r/ROMI.git.

Authors:Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen
Title: SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
Abstract:
Large reasoning models (LRMs) like OpenAI o1 and DeepSeek‑R1 achieve high accuracy on complex tasks by adopting long chain‑of‑thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over‑compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO‑based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two‑fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU‑RTEAS/SmartThinker.

Authors:Yusong Wang, Chuang Yang, Jiawei Wang, Xiaohang Xu, Jiayi Xu, Dongyuan Li, Chuan Xiao, Renhe Jiang
Title: ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework
Abstract:
Human mobility generation aims to synthesize plausible trajectory data, which is widely used in urban system research. While Large Language Model‑based methods excel at generating routine trajectories, they struggle to capture deviated mobility during large‑scale societal events. This limitation stems from two critical gaps: (1) the absence of event‑annotated mobility datasets for design and evaluation, and (2) the inability of current frameworks to reconcile competitions between users' habitual patterns and event‑imposed constraints when making trajectory decisions. This work addresses these gaps with a twofold contribution. First, we construct the first event‑annotated mobility dataset covering three major events: Typhoon Hagibis, COVID‑19, and the Tokyo 2021 Olympics. Second, we propose ELLMob, a self‑aligned LLM framework that first extracts competing rationales between habitual patterns and event constraints, based on Fuzzy‑Trace Theory, and then iteratively aligns them to generate trajectories that are both habitually grounded and event‑responsive. Extensive experiments show that ELLMob wins state‑of‑the‑art baselines across all events, demonstrating its effectiveness. Our codes and datasets are available at https://github.com/deepkashiwa20/ELLMob.

Authors:Darius Catrina, Christian Bepler, Samuel Sledzieski, Rohit Singh
Title: Reverse Distillation: Consistently Scaling Protein Language Model Representations
Abstract:
Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid‑sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka‑style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse‑distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly‑shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse‑distilled ESM‑2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse‑distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.

Authors:Najeeb Jebreel, Mona Khalil, David Sánchez, Josep Domingo-Ferrer
Title: Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions
Abstract:
Membership inference attacks (MIAs) have become the standard tool for evaluating privacy leakage in machine learning (ML). Among them, the Likelihood‑Ratio Attack (LiRA) is widely regarded as the state of the art when sufficient shadow models are available. However, prior evaluations have often overstated the effectiveness of LiRA by attacking models overconfident on their training samples, calibrating thresholds on target data, assuming balanced membership priors, and/or overlooking attack reproducibility. We re‑evaluate LiRA under a realistic protocol that (i) trains models using anti‑overfitting (AOF) and transfer learning (TL), when applicable, to reduce overconfidence as in production models; (ii) calibrates decision thresholds using shadow models and data rather than target data; (iii) measures positive predictive value (PPV, or precision) under shadow‑based thresholds and skewed membership priors (pi <= 10%); and (iv) quantifies per‑sample membership reproducibility across different seeds and training variations. We find that AOF significantly weakens LiRA, while TL further reduces attack effectiveness while improving model accuracy. Under shadow‑based thresholds and skewed priors, LiRA's PPV often drops substantially, especially under AOF or AOF+TL. We also find that thresholded vulnerable sets at extremely low FPR show poor reproducibility across runs, while likelihood‑ratio rankings are more stable. These results suggest that LiRA, and likely weaker MIAs, are less effective than previously suggested under realistic conditions, and that reliable privacy auditing requires evaluation protocols that reflect practical training practices, feasible attacker assumptions, and reproducibility considerations. Code is available at https://github.com/najeebjebreel/lira_analysis.

Authors:Ramin Akbari, Milad Afshari, Vishnu Naresh Boddeti
Title: Obliviator Reveals the Cost of Nonlinear Guardedness in Concept Erasure
Abstract:
Concept erasure aims to remove unwanted attributes, such as social or demographic factors, from learned representations, while preserving their task‑relevant utility. While the goal of concept erasure is protection against all adversaries, existing methods remain vulnerable to nonlinear ones. This vulnerability arises from their failure to fully capture the complex, nonlinear statistical dependencies between learned representations and unwanted attributes. Moreover, although the existence of a trade‑off between utility and erasure is expected, its progression during the erasure process, i.e., the cost of erasure, remains unstudied. In this work, we introduce Obliviator, a post‑hoc erasure method designed to fully capture nonlinear statistical dependencies. We formulate erasure from a functional perspective, leading to an optimization problem involving a composition of kernels that lacks a closed‑form solution. Instead of solving this problem in a single shot, we adopt an iterative approach that gradually morphs the feature space to achieve a more utility‑preserving erasure. Unlike prior methods, Obliviator guards unwanted attribute against nonlinear adversaries. Our gradual approach quantifies the cost of nonlinear guardedness and reveals the dynamics between attribute protection and utility‑preservation over the course of erasure. The utility‑erasure trade‑off curves obtained by Obliviator outperform the baselines and demonstrate its strong generalizability: its erasure becomes more utility‑preserving when applied to the better‑disentangled representations learned by more capable models.

Authors:Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari, Mert D. Pesé
Title: SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition
Abstract:
We present SLNet, a lightweight backbone for 3D point cloud recognition designed to achieve strong performance without the computational cost of many recent attention, graph, and deep MLP based models. The model is built on two simple ideas: NAPE (Nonparametric Adaptive Point Embedding), which captures spatial structure using a combination of Gaussian RBF and cosine bases with input adaptive bandwidth and blending, and GMU (Geometric Modulation Unit), a per channel affine modulator that adds only 2D learnable parameters. These components are used within a four stage hierarchical encoder with FPS+kNN grouping, nonparametric normalization, and shared residual MLPs. In experiments, SLNet shows that a very small model can still remain highly competitive across several 3D recognition tasks. On ModelNet40, SLNet‑S with 0.14M parameters and 0.31 GFLOPs achieves 93.64% overall accuracy, outperforming PointMLP‑elite with 5x fewer parameters, while SLNet‑M with 0.55M parameters and 1.22 GFLOPs reaches 93.92%, exceeding PointMLP with 24x fewer parameters. On ScanObjectNN, SLNet‑M achieves 84.25% overall accuracy within 1.2 percentage points of PointMLP while using 28x fewer parameters. For large scale scene segmentation, SLNet‑T extends the backbone with local Point Transformer attention and reaches 58.2% mIoU on S3DIS Area 5 with only 2.5M parameters, more than 17x fewer than Point Transformer V3. We also introduce NetScore+, which extends NetScore by incorporating latency and peak memory so that efficiency can be evaluated in a more deployment oriented way. Across multiple benchmarks and hardware settings, SLNet delivers a strong overall balance between accuracy and efficiency. Code is available at: https://github.com/m‑saeid/SLNet.

Authors:Xiang Zhang, Hongming Xu, Le Zhou, Wei Zhou, Xuanhe Zhou, Guoliang Li, Yuyu Luo, Changdong Liu, Guorun Chen, Jiang Liao, Fan Wu
Title: Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System
Abstract:
Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built‑in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt‑based approaches tightly couple intent reasoning with dialect syntax, rule‑based translators often degrade native operators into generic constructs, and multi‑dialect fine‑tuning suffers from cross‑dialect interference. In this paper, we present Dial, a knowledge‑grounded framework for dialect‑specific NL2SQL. Dial introduces: (1) a Dialect‑Aware Logical Query Planning module that converts natural language into a dialect‑aware logical query plan via operator‑level intent decomposition and divergence‑aware specification; (2) HINT‑KB, a hierarchical intent‑aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution‑driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS‑NL2SQL, a benchmark covering six major database systems with 2,218 dialect‑specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state‑of‑the‑art baselines. The code is at https://github.com/weAIDB/Dial.

Authors:Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang
Title: Generalization in Online Reinforcement Learning for Mobile Agents
Abstract:
Graphical user interface (GUI)‑based mobile agents automate digital tasks on mobile devices by interpreting natural‑language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision‑language‑model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open‑source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce AndroidWorld‑Generalization, a benchmark with three increasingly challenging regimes for evaluating zero‑shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld‑Generalization show that RL enables a 7B‑parameter VLM agent to surpass supervised fine‑tuning baselines, yielding a 26.1% improvement on unseen instances but only limited gains on unseen templates (15.7%) and apps (8.3%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few‑shot adaptation at test‑time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open‑source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnotehttps://github.com/zihuanjiang/AndroidWorld‑Generalization.

Authors:Antonio De Santis, Schrasing Tong, Marco Brambilla, Lalana Kagal
Title: Learning Concept Bottleneck Models from Mechanistic Explanations
Abstract:
Concept Bottleneck Models (CBMs) aim for ante‑hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State‑of‑the‑art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a‑priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black‑box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M‑CBM), which builds the bottleneck directly from a black‑box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also introduce the Number of Contributing Concepts (NCC), a decision‑level sparsity metric that extends the recently proposed NEC metric. Across diverse datasets, we show that M‑CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at https://github.com/Antonio‑Dee/M‑CBM.

Authors:Yatharth Sharma
Title: Fast and Flexible Audio Bandwidth Extension via Vocos
Abstract:
We propose a Vocos‑based bandwidth extension model that enhances audio at 8‑48 kHz by generating missing high‑frequency content. Inputs are resampled to 48 kHz and processed by a neural vocoder backbone, enabling a single network to support arbitrary upsampling ratios. A lightweight Linkwitz‑Riley‑inspired refiner merges the original low band with the generated high frequencies via a smooth crossover. On validation, the model achieves competitive log‑spectral distance while running at a real‑time factor of 0.0001 on an NVIDIA A100 GPU and 0.0053 on an 8‑core CPU, demonstrating practical, high‑quality BWE at extreme throughput.

Authors:Abbas Mammadov, So Takao, Bohan Chen, Ricardo Baptista, Morteza Mardani, Yee Whye Teh, Julius Berner
Title: Variational Flow Maps: Make Some Noise for One-Step Conditional Generation
Abstract:
Flow maps enable high‑quality image generation in a single forward pass. However, unlike iterative diffusion models, their lack of an explicit sampling trajectory impedes incorporating external constraints for conditional generation and solving inverse problems. We put forth Variational Flow Maps, a framework for conditional sampling that shifts the perspective of conditioning from "guiding a sampling path", to that of "learning the proper initial noise". Specifically, given an observation, we seek to learn a noise adapter model that outputs a noise distribution, so that after mapping to the data space via flow map, the samples respect the observation and data prior. To this end, we develop a principled variational objective that jointly trains the noise adapter and the flow map, improving noise‑data alignment, such that sampling from complex data posterior is achieved with a simple adapter. Experiments on various inverse problems show that VFMs produce well‑calibrated conditional samples in a single (or few) steps. For ImageNet, VFM attains competitive fidelity while accelerating the sampling by orders of magnitude compared to alternative iterative diffusion/flow models. Code is available at https://github.com/abbasmammadov/VFM

Authors:Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò
Title: Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation
Abstract:
Predicting how cells respond to genetic perturbations is fundamental to understanding gene function, disease mechanisms, and therapeutic development. While recent deep learning approaches have shown promise in modeling single‑cell perturbation responses, they struggle to generalize across cell types and perturbation contexts due to limited contextual information during generation. We introduce PT‑RAG (Perturbation‑aware Two‑stage Retrieval‑Augmented Generation), a novel framework that extends Retrieval‑Augmented Generation beyond traditional language‑model applications to cellular biology. Unlike standard RAG systems designed for text retrieval with pre‑trained LLMs, perturbation retrieval lacks established similarity metrics and requires learning what constitutes relevant context, making differentiable retrieval essential. PT‑RAG addresses this through a two‑stage pipeline: first, retrieving candidate perturbations K using GenePT embeddings, then adaptively refining the selection through Gumbel‑Softmax discrete sampling conditioned on both the cell state and the input perturbation. This cell‑type‑aware differentiable retrieval enables end‑to‑end optimization of the retrieval objective jointly with generation. On the Replogle‑Nadig single‑gene perturbation dataset, we demonstrate that PT‑RAG outperforms both STATE and vanilla RAG under identical experimental conditions, with the strongest gains in distributional similarity metrics (W_1, W_2). Notably, vanilla RAG's dramatic failure is itself a key finding: it demonstrates that differentiable, cell‑type‑aware retrieval is essential in this domain, and that naive retrieval can actively harm performance. Our results establish retrieval‑augmented generation as a promising paradigm for modelling cellular responses to gene perturbation. The code to reproduce our experiments is available at https://github.com/difra100/PT‑RAG_ICLR.

Authors:Yuxuan Han, Meng-Hao Guo, Zhengning Liu, Wenguang Chen, Shi-Min Hu
Title: Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts
Abstract:
Optimizing GPU kernels manually is a challenging and time‑consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM‑driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general‑purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi‑scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines, each supporting both FP32 and BF16 precision. Building on this benchmark, we introduce CUDAMaster, a multi‑agent, hardware‑aware system for kernel optimization that leverages profiling information and automatically constructs the full compilation and execution toolchain. Experimental results demonstrate that CUDAMaster achieves significant speedups across most operators, outperforming Astra by about 35%. In several cases, its performance matches or surpasses that of highly optimized, closed‑source libraries such as cuBLAS. A demo showcasing the original and optimized code for each operator is available at https://hanyx2021.github.io/MSKernelBenchDemo/.

Authors:Tao Shi, Liangming Chen, Long Jin, Mengchu Zhou
Title: Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers
Abstract:
In the training of neural networks, adaptive moment estimation (Adam) typically converges fast but exhibits suboptimal generalization performance. A widely accepted explanation for its defect in generalization is that it often tends to converge to sharp minima. To enhance its ability to find flat minima, we propose its new variant named inverse Adam (InvAdam). The key improvement of InvAdam lies in its parameter update mechanism, which is opposite to that of Adam. Specifically, it computes element‑wise multiplication of the first‑order and second‑order moments, while Adam computes the element‑wise division of these two moments. This modification aims to increase the step size of the parameter update when the elements in the second‑order moments are large and vice versa, which helps the parameter escape sharp minima and stay at flat ones. However, InvAdam's update mechanism may face challenges in convergence. To address this challenge, we propose dual Adam (DualAdam), which integrates the update mechanisms of both Adam and InvAdam, ensuring convergence while enhancing generalization performance. Additionally, we introduce the diffusion theory to mathematically demonstrate InvAdam's ability to escape sharp minima. Extensive experiments are conducted on image classification tasks and large language model (LLM) fine‑tuning. The results validate that DualAdam outperforms Adam and its state‑of‑the‑art variants in terms of generalization performance. The code is publicly available at https://github.com/LongJin‑lab/DualAdam.

Authors:Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang
Title: Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
Abstract:
Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown‑Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual‑access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward‑hacking rates. Using this environment, we study reward hacking in open‑weight LLMs and find that such behaviors can be unintentionally learned during supervised fine‑tuning (SFT) when even a small fraction of reward‑hacking trajectories leak into training data. As little as 1% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open‑source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib‑khan5040/Countdown‑Code.

Authors:Sofiane Ouaari, Jules Kreuer, Nico Pfeifer
Title: How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences
Abstract:
DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings‑as‑a‑Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero‑shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT‑2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per‑token embeddings allow near‑perfect sequence reconstruction across all models. For mean‑pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT‑2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy‑aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not‑a‑feature/DNA‑Embedding‑Inversion.

Authors:Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang
Title: Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration
Abstract:
Cooperative multi‑agent reinforcement learning (MARL) systems powered by large language models (LLMs) are frequently optimized via sparse terminal‑only feedback. This shared signal entangles upstream decisions, obstructing accurate decision‑level credit assignment. To address this trajectory‑level diffusion, we introduce Contextual Counterfactual Credit Assignment (\textttC3). Instead of distributing rewards across an entire episode, \textttC3 isolates the causal impact of individual messages by freezing the exact transcript‑derived context, evaluating context‑matched alternatives via fixed‑continuation replay, and applying a leave‑one‑out (LOO) baseline. This localized intervention extracts unbiased, low‑variance marginal advantages for standard policy‑gradient optimization. Evaluated across five mathematical and coding benchmarks under matched budgets, \textttC3 improves terminal performance over established baselines. Mechanistic diagnostics further show that these gains are accompanied by higher credit fidelity, lower contextual variance, and stronger inter‑agent causal dependence. Our code is available at https://github.com/EIT‑EAST‑Lab/C3.

Authors:Irene Wang, Vishnu Varma Venkata, Arvind Krishnamurthy, Divya Mahajan
Title: NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning
Abstract:
The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology‑agnostic search, handling communication and memory separately. Without per‑device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute‑limiting scalability and efficiency on real datacenter networks. We present NEST, a network‑, compute‑, and memory‑aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co‑location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state‑of‑the‑art baselines, providing a foundation for co‑designing parallelization strategies and datacenter interconnects for next‑generation AI infrastructure. The source code of NEST is available at: https://github.com/scai‑tech/Nest

Authors:Gregor Baer
Title: xaitimesynth: A Python Package for Evaluating Attribution Methods for Time Series with Synthetic Ground Truth
Abstract:
Evaluating time series attribution methods is difficult because real‑world datasets rarely provide ground truth for which time points drive a prediction. A common workaround is to generate synthetic data where class‑discriminating features are placed at known locations, but each study currently reimplements this from scratch. We introduce xaitimesynth, a Python package that provides reusable infrastructure for this evaluation approach. The package generates synthetic time series following an additive model where each sample is a sum of background signal and a localized, class‑discriminating feature, with the feature window automatically tracked as a ground truth mask. A fluent data generation API and YAML configuration format allow flexible and reproducible dataset definitions for both univariate and multivariate time series. The package also provides standard localization metrics, including AUC‑PR, AUC‑ROC, Relevance Mass Accuracy, and Relevance Rank Accuracy. xaitimesynth is open source and available at https://github.com/gregorbaer/xaitimesynth.

Authors:Sayeem Bin Zaman, Fahim Hafiz, Riasat Azim
Title: SpatialMAGIC: A Hybrid Framework Integrating Graph Diffusion and Spatial Attention for Spatial Transcriptomics Imputation
Abstract:
Spatial transcriptomics (ST) enables mapping gene expression with spatial context but is severely affected by high sparsity and technical noise, which conceals true biological signals and hinders downstream analyses. To address these challenges, SpatialMagic was proposed, which is a hybrid imputation model combining MAGIC‑based graph diffusion with transformer‑based spatial self‑attention. The long‑range dependencies in the gene expression are captured by graph diffusion, and local neighborhood structure is captured by spatial attention models, which allow for recovering the missing expression values, retaining spatial consistency. Across multiple platforms, SpatialMagic consistently outperforms existing baselines, including MAGIC and attention‑based models, achieving peak Adjusted Rand Index (ARI) scores in clustering accuracy of 0.3301 on high‑resolution Stereo‑Seq data, 0.3074 on Slide‑Seq, and 0.4216 on the Sci‑Space dataset. Beyond quantitative improvements, SpatialMagic substantially enhances downstream biological analyses by improving the detection of both up‑ and down‑regulated genes while maintaining regulatory consistency across datasets. The pathway enrichment analysis of the recovered genes indicates that they are involved in consistent processes across key metabolic, transport, and neural signaling pathways, suggesting that the framework improves data quality while preserving biological interpretability. Overall, SpatialMagic's hybrid diffusion attention strategy and refinement module outperform state‑of‑the‑art baselines on quantitative metrics and provide a better understanding of the imputed data by preserving tissue architecture and uncovering biologically relevant genes. The source code and datasets are provided in the following link: https://github.com/sayeemzzaman/SpatialMAGIC

Authors:Jiefu Zhang, Yang Xu, Vaneet Aggarwal
Title: Don't Freeze, Don't Crash: Extending the Safe Operating Range of Neural Navigation in Dense Crowds
Abstract:
Navigating safely through dense crowds requires collision avoidance that generalizes beyond the densities seen during training. Learning‑based crowd navigation can break under out‑of‑distribution crowd sizes due to density‑sensitive observation normalization and social‑cost scaling, while analytical solvers often remain safe but freeze in tight interactions. We propose a reinforcement learning approach for dense, variable‑density navigation that attains zero‑shot density generalization using a density‑invariant observation encoding with density‑randomized training and physics‑informed proxemic reward shaping with density‑adaptive scaling. The encoding represents the distance‑sorted K nearest pedestrians plus bounded crowd summaries, keeping input statistics stable as crowd size grows. Trained with N\!\in\![11,16] pedestrians in a 3\mathrmm×3\mathrmm arena and evaluated up to N\!=\!21 pedestrians (1.3× denser), our policy reaches the goal in >99% of episodes and achieves 86% collision‑free success in random crowds, with markedly less freezing than analytical methods and a >\!60‑point collision‑free margin over learning‑based benchmark methods. Codes are available at \hrefhttps://github.com/jznmsl/PSS‑Socialhttps://github.com/jznmsl/PSS‑Social.

Authors:Guanglin Zhou, Armin Catic, Motahare Shabestari, Matthew Young, Chaiquan Li, Katrina Poppe, Sebastiano Barbieri
Title: From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories
Abstract:
Access to electronic health records (EHRs) for digital health research is often limited by privacy regulations and institutional barriers. Synthetic EHRs have been proposed as a way to enable safe and sovereign data sharing; however, existing methods may produce records that capture overall statistical properties of real data but present inconsistencies across clinical processes and observations. We developed an integrated pipeline to make synthetic patient trajectories clinically consistent through two synergistic steps: high‑fidelity generation and scalable auditing. Using the MIMIC‑IV database, we trained a knowledge‑grounded generative model that represents nearly 32,000 distinct clinical events, including demographics, laboratory measurements, medications, procedures, and diagnoses, while enforcing structural integrity. To support clinical consistency at scale, we incorporated an automated auditing module leveraging large language models to filter out clinical inconsistencies (e.g., contraindicated medications) that escape probabilistic generation. We generated 18,071 synthetic patient records derived from a source cohort of 180,712 real patients. While synthetic clinical event probabilities demonstrated robust agreement (mean bias effectively 0.00) and high correlation (R2=0.99) with the real counterparts, review of a random sample of synthetic records (N=20) by three clinicians identified inconsistencies in 45‑60% of them. Automated auditing reduced the difference between real and synthetic data (Cohen's effect size d between 0.59 and 1.60 before auditing, and between 0.18 and 0.67 after auditing). Downstream models trained on audited data matched or even exceeded real‑data performance. We found no evidence of privacy risks, with membership inference performance indistinguishable from random guessing (F1‑score=0.51).

Authors:Kejing Lu, Zhenpeng Pan, Jianbin Qin, Yoshiharu Ishikawa, Chuan Xiao
Title: Approximate Nearest Neighbor Search for Modern AI: A Projection-Augmented Graph Approach
Abstract:
Approximate Nearest Neighbor Search (ANNS) is fundamental to modern AI applications. Most existing solutions optimize query efficiency but fail to align with the practical requirements of modern workloads. In this paper, we outline six critical demands of modern AI applications: high query efficiency, fast indexing, low memory footprint, scalability to high dimensionality, robustness across varying retrieval sizes, and support for online insertions. To satisfy all these demands, we introduce Projection‑Augmented Graph (PAG), a new ANNS framework that integrates projection techniques into a graph index. PAG reduces unnecessary exact distance computations through asymmetric comparisons between exact and approximate distances as guided by projection‑based statistical tests. Three key components are designed and unified to the graph index to optimize indexing and searching. Experiments on six modern datasets demonstrate that PAG consistently achieves superior query per second (QPS)‑recall performance ‑‑ up to 5x faster than HNSW ‑‑ while offering fast indexing speed and moderate memory footprint. PAG remains robust as dimensionality and retrieval size increase and naturally supports online insertions.

Authors:Swamynathan V P
Title: SR-TTT: Surprisal-Aware Residual Test-Time Training
Abstract:
Test‑Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact‑attention KV‑cache with hidden state ``fast weights'' W_fast updated via self‑supervised learning during inference. However, pure TTT architectures suffer catastrophic failures on exact‑recall tasks (e.g., Needle‑in‑a‑Haystack). Because the fast weights aggressively compress the context into an information bottleneck, highly surprising or unique tokens are rapidly overwritten and forgotten by subsequent token gradient updates. We introduce SR‑TTT (Surprisal‑Aware Residual Test‑Time Training), which resolves this recall failure by augmenting the TTT backbone with a loss‑gated sparse memory mechanism. By dynamically routing only incompressible, highly surprising tokens to a traditional exact‑attention Residual Cache, SR‑TTT preserves O(1) memory for low‑entropy background context while utilizing exact attention exclusively for critical needles. Our complete implementation, training scripts, and pre‑trained weights are open‑source and available at: https://github.com/swamynathanvp/Surprisal‑Aware‑Residual‑Test‑Time‑Training.

Authors:Qingsong Zou, Zhi Yan, Zhiyao Xu, Kuofeng Gao, Jingyu Xiao, Yong Jiang
Title: SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts
Abstract:
Due to the strong context‑awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have been shown to effectively understand user needs and provide appropriate responses, most existing studies primarily focus on interpreting and executing user behaviors or instructions. However, a critical function of smart home assistants is the ability to detect when the home environment is in an anomalous state. This involves two key requirements: the LLM must accurately determine whether an anomalous condition is present, and provide either a clear explanation or actionable suggestions. To enhance the anomaly detection capabilities of next‑generation LLM‑based smart home assistants, we introduce SmartBench, which is the first smart home dataset designed for LLMs, containing both normal and anomalous device states as well as normal and anomalous device state transition contexts. We evaluate 13 mainstream LLMs on this benchmark. The experimental results show that most state‑of‑the‑art models cannot achieve good anomaly detection performance. For example, Claude‑Sonnet‑4.5 achieves only 66.1% detection accuracy on context‑independent anomaly categories, and performs even worse on context‑dependent anomalies, with an accuracy of only 57.8%. More experimental results suggest that next‑generation LLM‑based smart home assistants are still far from being able to effectively detect and handle anomalous conditions in the smart home environment. Our dataset is publicly available at https://github.com/horizonsinzqs/SmartBench.

Authors:Rishabh Tiwari, Aditya Tomar, Udbhav Bamba, Monishwaran Maheswaran, Heng Yang, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Title: Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models
Abstract:
Process Reward Models (PRMs) are rapidly becoming the backbone of LLM reasoning pipelines, yet we demonstrate that state‑of‑the‑art PRMs are systematically exploitable under adversarial optimization pressure. To address this, we introduce a three‑tiered diagnostic framework that applies increasing adversarial pressure to quantify these vulnerabilities. Static perturbation analysis uncovers a fluency‑logic dissociation: high invariance to surface‑level style changes reward changes <0.1, yet inconsistent detection of logically‑corrupted reasoning, with different models failing on different attack types. Adversarial optimization demonstrates that gradient‑based attacks inflate rewards on invalid trajectories, with reward landscapes exhibiting wide, exploitable peaks. RL‑induced reward hacking exposes the critical failure mode: policies trained on AIME problems achieve near‑perfect PRM rewards (>0.9), while ground‑truth accuracy remains low (below 4%), with 43% of reward gains attributable to stylistic shortcuts. These findings reveal that current PRMs function as fluency detectors rather than reasoning verifiers, creating systematic blind spots that undermine their use as training signals. We release PRM‑BiasBench and a diagnostic toolkit to enable robustness evaluation before deployment. The code and dataset are available at https://github.com/SqueezeAILab/reward‑under‑attack.

Authors:Fali Wang, Chenglin Weng, Xianren Zhang, Siyuan Hong, Hui Liu, Suhang Wang
Title: GraphSkill: Documentation-Guided Hierarchical Retrieval-Augmented Coding for Complex Graph Reasoning
Abstract:
The growing demand for automated graph algorithm reasoning has attracted increasing attention in the large language model (LLM) community. Recent LLM‑based graph reasoning methods typically decouple task descriptions from graph data, generate executable code augmented by retrieval from technical documentation, and refine the code through debugging. However, we identify two key limitations in existing approaches: (i) they treat technical documentation as flat text collections and ignore its hierarchical structure, leading to noisy retrieval that degrades code generation quality; and (ii) their debugging mechanisms focus primarily on runtime errors, yet ignore more critical logical errors. To address them, we propose \method, an agentic hierarchical retrieval‑augmented coding framework that exploits the document hierarchy through top‑down traversal and early pruning, together with a self‑debugging coding agent that iteratively refines code using automatically generated small‑scale test cases. To enable comprehensive evaluation of complex graph reasoning, we introduce a new dataset, \dataset, covering small‑scale, large‑scale, and composite graph reasoning tasks. Extensive experiments demonstrate that our method achieves higher task accuracy and lower inference cost compared to baselines\footnoteThe code is available at \hrefhttps://github.com/FairyFali/GraphSkill\textcolorbluehttps://github.com/FairyFali/GraphSkill..

Authors:Ching-Yun Ko, Pin-Yu Chen
Title: vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM
Abstract:
Modern artificial intelligence (AI) models are deployed on inference engines to optimize runtime efficiency and resource allocation, particularly for transformer‑based large language models (LLMs). The vLLM project is a major open‑source library to support model serving and inference. However, the current implementation of vLLM limits programmability of the internal states of deployed models. This prevents the use of popular test‑time model alignment and enhancement methods. For example, it prevents the detection of adversarial prompts based on attention patterns or the adjustment of model responses based on activation steering. To bridge this critical gap, we present vLLM Hook, an opensource plug‑in to enable the programming of internal states for vLLM models. Based on a configuration file specifying which internal states to capture, vLLM Hook provides seamless integration to vLLM and supports two essential features: passive programming and active programming. For passive programming, vLLM Hook probes the selected internal states for subsequent analysis, while keeping the model generation intact. For active programming, vLLM Hook enables efficient intervention of model generation by altering the selected internal states. In addition to presenting the core functions of vLLM Hook, in version 0, we demonstrate 3 use cases including prompt injection detection, enhanced retrieval‑augmented retrieval (RAG), and activation steering. Finally, we welcome the community's contribution to improve vLLM Hook via https://github.com/ibm/vllm‑hook.

Authors:Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum, Lu Yin, Na Zhao, Xiatian Zhu
Title: SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation
Abstract:
Incremental Few‑Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base‑training scenes. We introduce SCOPE (Scene‑COntextualised Prototype Enrichment), a plug‑and‑play background‑guided prototype enrichment framework that integrates with any prototype‑based 3D segmentation method. After base training, a class‑agnostic segmentation model extracts high‑confidence pseudo‑instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few‑shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel‑class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available https://github.com/Surrey‑UP‑Lab/SCOPE.

Authors:Kartik Sharma, Rakshit S. Trivedi
Title: COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Abstract:
Activation steering methods enable inference‑time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade‑off: sample‑efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD‑Steer, a training‑free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in‑context examples. Our key insight is that the effect of fine‑tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite‑difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD‑Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD‑Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context‑aware model control that can flexibly address varying loss‑driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

Authors:Xiaojie Li, Yu Han, Zhizheng Lu, Shi Jin, Chao-Kai Wen
Title: U6G XL-MIMO Radiomap Prediction: Multi-Config Dataset and Beam Map Approach
Abstract:
The upper 6 GHz (U6G) band with XL‑MIMO is a key enabler for sixth‑generation wireless systems, yet intelligent radiomap prediction for such systems remains challenging. Existing datasets support only small‑scale arrays (up to 8x8) with predominantly isotropic antennas, far from the 1024‑element directional arrays envisioned for 6G. Moreover, current methods encode array configurations as scalar parameters, forcing neural networks to extrapolate array‑specific radiation patterns, which fails when predicting radiomaps for configurations absent from training data. To jointly address data scarcity and generalization limitations, this paper advances XL‑MIMO radiomap prediction from three aspects. To overcome data limitations, we construct the first XL‑MIMO radiomap dataset containing 78400 radiomaps across 800 urban scenes, five frequency bands (1.8‑6.7 GHz), and nine array configurations up to 32x32 uniform planar arrays with directional elements. To enable systematic evaluation, we establish a comprehensive benchmark framework covering practical scenarios from coverage estimation without field measurements to generalization across unseen configurations and environments. To enable generalization to arbitrary beam configurations without retraining, we propose the beam map, a physics‑informed spatial feature that analytically computes array‑specific coverage patterns. By decoupling deterministic array radiation from data learned multipath propagation, beam maps shift generalization from neural network extrapolation to physics‑based computation. Integrating beam maps into existing architectures reduces mean absolute error by up to 60.0% when generalizing to unseen configurations and up to 50.5% when transferring to unseen environments. The complete dataset and code are publicly available at https://lxj321.github.io/MulticonfigRadiomapDataset/.

Authors:Han-Chen Zhang, Zi-Hao Zhou, Mao-Lin Luo, Shimin Di, Min-Ling Zhang, Tong Wei
Title: DC-Merge: Improving Model Merging with Directional Consistency
Abstract:
Model merging aims to integrate multiple task‑adapted models into a unified model that preserves the knowledge of each task. In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi‑task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC‑Merge, a method for directional‑consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy‑balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision‑language benchmarks show that DC‑Merge consistently achieves state‑of‑the‑art performance in both full fine‑tuning and LoRA settings. The implementation code is available at https://github.com/Tobeginwith/DC‑Merge.

Authors:Soumya Mazumdar, Vineet Kumar Rakesh
Title: TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation
Abstract:
Diffusion models have recently advanced photorealistic human synthesis, although practical talking‑head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio‑visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference‑conditioned latent diffusion framework that explores few‑step inference for efficient audio‑driven talking‑head generation. The approach adopts a teacher‑student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame‑to‑frame flicker during synthesis, while viseme‑based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising‑stage component‑level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU‑only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion‑based talking‑head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff

Authors:Habibullah Akbar
Title: Weak-SIGReg: Covariance Regularization for Stable Deep Learning
Abstract:
Modern neural network optimization relies heavily on architectural priorssuch as Batch Normalization and Residual connectionsto stabilize training dynamics. Without these, or in low‑data regimes with aggressive augmentation, low‑bias architectures like Vision Transformers (ViTs) often suffer from optimization collapse. This work adopts Sketched Isotropic Gaussian Regularization (SIGReg), recently introduced in the LeJEPA self‑supervised framework, and repurposes it as a general optimization stabilizer for supervised learning. While the original formulation targets the full characteristic function, a computationally efficient variant is derived, Weak‑SIGReg, which targets the covariance matrix via random sketching. Inspired by interacting particle systems, representation collapse is viewed as stochastic drift; SIGReg constrains the representation density towards an isotropic Gaussian, mitigating this drift. Empirically, SIGReg recovers the training of a ViT on CIFAR‑100 from a collapsed 20.73% to 72.02% accuracy without architectural hacks and significantly improves the convergence of deep vanilla MLPs trained with pure SGD. Code is available at \hrefhttps://github.com/kreasof‑ai/sigreggithub.com/kreasof‑ai/sigreg.

Authors:Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han
Title: Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
Abstract:
Large language models (LLMs) benefit substantially from supervised fine‑tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction‑based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step‑by‑step optimization trajectory. We reveal that answer‑only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference‑guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy's intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate × Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr‑group/RePO.

Authors:Xiang Zhang, Sohyun Yoo, Hongrui Wu, Chuan Li, Jianwen Xie, Zhuowen Tu
Title: PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Abstract:
We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post‑hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist‑ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point‑cloud encoder with pixel‑aligned image features and global scene context via cross‑attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high‑fidelity geometry. Experiments on synthetic and real‑world datasets show that PixARMesh achieves state‑of‑the‑art reconstruction quality while producing lightweight, high‑quality meshes ready for downstream applications.

Authors:Mingluo Su, Huan Wang
Title: ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Abstract:
Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one‑shot pruning is to leverage second‑order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left‑to‑right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre‑pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two‑level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2‑7B/13B/70B, LLaMA3‑8B, Mistral‑7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo‑su/ROSE.

Authors:Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju Kim
Title: ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Abstract:
While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt‑response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self‑correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external‑dependent refinement to an intrinsic, fully autonomous self‑reflection and self‑correction capabilities at inference time. We utilize an RL‑zero training paradigm with granular reward functions to optimize the entire reflection‑correction trajectory, teaching the model how to debug without reliance on ground‑truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder‑8B establishes a new state‑of‑the‑art (SOTA) among leading open‑source models in the 1.5B‑14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single‑attempt setting, rivaling or surpassing proprietary models like GPT‑5.1. Notably, our framework is significantly more token‑efficient than base models, reducing inference‑time compute overhead by approximately 40% through disciplined, high‑speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.

Authors:Son Thai Ly, Hien V. Nguyen
Title: Self-Auditing Parameter-Efficient Fine-Tuning for Few-Shot 3D Medical Image Segmentation
Abstract:
Adapting foundation models to new clinical sites remains challenging in practice. Domain shift and scarce annotations must be handled by experts, yet many clinical groups do not have ready access to skilled AI engineers to tune adapter designs and training recipes. As a result, adaptation cycles can stretch from weeks to months, particularly in few‑shot settings. Existing PEFT methods either require manual adapter configuration or automated searches that are computationally infeasible in few‑shot 3D settings. We propose SEA‑PEFT (SElf‑Auditing Parameter‑Efficient Fine‑Tuning) to automate this process. SEA‑PEFT treats adapter configuration as an online allocation problem solved during fine‑tuning rather than through manual, fixed‑topology choices. SEA‑PEFT uses a search‑audit‑allocate loop that trains active adapters, estimates each adapter's Dice utility by momentarily toggling it off, and then reselects the active set under a parameter budget using a greedy knapsack allocator. Exponential Moving Average and Interquartile Range smoothing, together with a Finite‑State Ranking controller, stabilize the loop and improve reliability in high‑noise few‑shot regimes. On TotalSegmentator and FLARE'22, SEA‑PEFT improves mean Dice by 2.4‑‑2.8 points over the strongest fixed‑topology PEFT baselines across 1/5/10‑shot settings while training <1% of parameters. For reproducibility purposes, we made our code publicly available at https://github.com/tsly123/SEA_PEFT

Authors:Mykola Pinchuk
Title: TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks
Abstract:
Autonomous coding agents can produce strong tabular baselines quickly on Kaggle‑style tasks. Practical value depends on end‑to‑end correctness and reliability under time limits. This paper introduces TML‑Bench, a tabular benchmark for data science agents on Kaggle‑style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private‑holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run‑to‑run variability. MiniMax‑M2.1 model achieves the best aggregate performance score on all four competitions under the paper's primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current run count. Code and materials are available at https://github.com/MykolaPinchuk/TML‑bench/tree/master.

Authors:Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, Yan Wang
Title: Making Reconstruction FID Predictive of Diffusion Generation FID
Abstract:
It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the original dataset. Additionally, we refine the claim that rFID correlates poorly with gFID, by showing that rFID correlates with sample quality in the diffusion refinement phase, whereas iFID correlates with sample quality in the diffusion navigation phase. Furthermore, we provide an explanation for why iFID correlates well with gFID, and why reconstruction metrics are negatively correlated with gFID, by connecting to results in the diffusion generalization and hallucination. Empirically, iFID is the first metric to demonstrate a strong correlation with diffusion gFID, achieving Pearson linear and Spearman rank correlations approximately 0.85. The source code is provided in https://github.com/tongdaxu/Making‑rFID‑Predictive‑of‑Diffusion‑gFID.

Authors:Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss, George H. Chen
Title: SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis
Abstract:
Estimating heterogeneous treatment effects (HTEs) from right‑censored survival data is critical in high‑stakes applications such as precision medicine and individualized policy‑making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta‑learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE‑Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi‑synthetic datasets that pair real‑world covariates with simulated treatments and outcomes, and (iii) real‑world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi‑synthetic, and real‑world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE‑Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE‑Bench .

Authors:Numan Saeed, Fadillah Adamsyah Maani, Mohammad Yaqub
Title: MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis
Abstract:
Fetal ultrasound AI could transform prenatal care in low‑resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point‑of‑care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off‑diagonal components: matched pair alignment is preserved while the off‑diagonal weight decays into negative values, repelling the student from the teacher's inter‑class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M‑parameter FetalCLIP teacher on zero‑shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub‑plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real‑time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.

Authors:Francisco M. Calatrava-Nicolás, Shoko Miyauchi, Vitor Fortes Rey, Paul Lukowicz, Todor Stoyanov, Oscar Martinez Mozos
Title: Embedded Inter-Subject Variability in Adversarial Learning for Inertial Sensor-Based Human Activity Recognition
Abstract:
This paper addresses the problem of Human Activity Recognition (HAR) using data from wearable inertial sensors. An important challenge in HAR is the model's generalization capabilities to new unseen individuals due to inter‑subject variability, i.e., the same activity is performed differently by different individuals. To address this problem, we propose a novel deep adversarial framework that integrates the concept of inter‑subject variability in the adversarial task, thereby encouraging subject‑invariant feature representations and enhancing the classification performance in the HAR problem. Our approach outperforms previous methods in three well‑established HAR datasets using a leave‑one‑subject‑out (LOSO) cross‑validation. Further results indicate that our proposed adversarial task effectively reduces inter‑subject variability among different users in the feature space, and it outperforms adversarial tasks from previous works when integrated into our framework. Code: https://github.com/FranciscoCalatrava/EmbeddedSubjectVariability.git

Authors:Luca Della Libera, Cem Subakan, Mirco Ravanelli
Title: WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
Abstract:
Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single‑stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self‑supervised WavLM representations into a single codebook and optimizing an autoregressive next‑chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm‑web/.

Authors:Hokyun Im, Andrey Kolobov, Jianlong Fu, Youngwoon Lee
Title: Latent Policy Steering through One-Step Flow Policies
Abstract:
Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade‑off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent‑space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high‑fidelity latent policy improvement by backpropagating original‑action‑space Q‑gradients through a differentiable one‑step MeanFlow policy to update a latent‑action‑space actor. By eliminating proxy latent critics, LPS allows an original‑action‑space critic to guide end‑to‑end latent‑space optimization, while the one‑step MeanFlow policy serves as a behavior‑constrained generative prior. This decoupling yields a robust method that works out‑of‑the‑box with minimal tuning. Across OGBench and real‑world robotic tasks, LPS achieves state‑of‑the‑art performance and consistently outperforms behavioral cloning and strong latent steering baselines.

Authors:Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei
Title: SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity
Abstract:
NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning ‑‑ a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder (2N‑2):2N patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the (2N‑2):2N model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any (2N‑2):2N weight block into N‑1 overlapping 2:4‑compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per‑token quantization at near‑zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX‑spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute‑bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper‑bound N/(N‑1)=4/3 at 6:8 weight sparsity in Qwen2.5‑7B, establishing (2N‑2):2N as a practical path to accuracy‑preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

Authors:Alper Yıldırım
Title: The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Abstract:
Mechanistic interpretability typically relies on post‑hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking ‑ delayed generalization in Transformers trained on cyclic modular addition (Zp) ‑ investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data‑dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude‑based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data‑dependent query‑key routing with a uniform distribution, reducing the attention layer to a Continuous Bag‑of‑Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task‑specific geometric alignment rather than a generic optimization stabilizer, we use non‑commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.

Authors:Yize Wu, Ke Gao, Ling Li, Yanjun Wu
Title: Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation
Abstract:
Low‑Rank Adaptation (LoRA) is a widely adopted parameter‑efficient method for fine‑tuning Large Langauge Models. It updates the weight matrix as W=W_0+sBA, where W_0 is the original frozen weight, s is a scaling factor and A,B are trainable low‑rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self‑stabilized) under appropriate hyper‑parameters and initializations of A and B. However, we also uncover a fundamental limitation that the necessary non‑zero initialization of A compromises self‑stability, leading to suboptimal performances. To address this challenge, we propose Stable‑LoRA, a weight‑shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning. By progressively shrinking A during the earliest training steps, Stable‑LoRA is both theoretically and empirically validated to effectively eliminate instability of LoRA feature learning while preserving the benefits of the non‑zero start. Experiments show that Stable‑LoRA consistently outperforms other baselines across diverse models and tasks, with no additional memory usage and only negligible computation overheads. The code is available at https://github.com/Yize‑Wu/Stable‑LoRA.

Authors:Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, YunXiang Gong
Title: FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
Abstract:
Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large‑scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large‑scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor 1/N lower than those of existing methods, where N is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state‑of‑the‑art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.

Authors:Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You
Title: Axiomatic On-Manifold Shapley via Optimal Generative Flows
Abstract:
Shapley‑based attribution is critical for post‑hoc XAI but suffers from off‑manifold artifacts due to heuristic baselines. While generative methods attempt to address this, they often introduce geometric inefficiency and discretization drift. We propose a formal theory of on‑manifold Aumann‑Shapley attributions driven by optimal generative flows. We prove a representation theorem establishing the gradient line integral as the unique functional satisfying efficiency and geometric axioms, notably reparameterization invariance. To resolve path ambiguity, we select the kinetic‑energy‑minimizing Wasserstein‑2 geodesic transporting a prior to the data distribution. This yields a canonical attribution family that recovers classical Shapley for additive models and admits provable stability bounds against flow approximation errors. By reframing baseline selection as a variational problem, our method experimentally outperforms baselines, achieving strict manifold adherence via vanishing Flow Consistency Error and superior semantic alignment characterized by Structure‑Aware Total Variation. Our code is on https://github.com/cenweizhang/OTFlowSHAP.

Authors:Nilusha Jayawickrama, Henrik Toikka, Risto Ojala
Title: Person Detection and Tracking from an Overhead Crane LiDAR
Abstract:
This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle‑centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site‑specific overhead LiDAR dataset with 3D human bounding‑box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking‑by‑detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance‑sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real‑time feasibility. Finally, we release our dataset and implementations in GitHub to support further research

Authors:Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu
Title: BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
Abstract:
Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low‑probability actions, disproportionately suppressing high‑advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band‑constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f‑divergences into dynamic, probability‑aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed‑form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip‑Higher, while robustly mitigating entropy collapse.

Authors:Yuheng Lei, Zhixuan Liang, Hongyuan Zhang, Ping Luo
Title: VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory
Abstract:
Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single‑step observations or short‑context histories, making them struggle with non‑Markovian tasks that require long‑term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real‑time constraints in robotic systems. By contrast, humans can compress important past experiences into long‑term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non‑Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short‑term working memory, and introduces a Transformer‑based contextual memory compressor that recursively converts out‑of‑window observations into a fixed number of episodic memory tokens. The compressor uses self‑attention over a cache of past summary tokens and cross‑attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short‑term and episode‑wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state‑of‑the‑art baselines including diffusion policies and vision‑language‑action (VLA) models by more than 20% on the memory‑intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.

Authors:Yiang Wu, Qiong Wu, Pingyi Fan, Kezhi Wang, Wen Chen, Guoqiang Mao, Khaled B. Letaief
Title: U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning
Abstract:
This demonstration presents U‑Parking, a distributed Ultra‑Wideband (UWB)‑assisted autonomous parking system. By integrating Large Language Models (LLMs)‑assisted planning with robust fusion localization and trajectory tracking, it enables reliable automated parking in challenging indoor environments, as validated through real‑vehicle demonstrations.

Authors:Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem Bıyık
Title: Causally Robust Reward Learning from Reason-Augmented Preference Feedback
Abstract:
Preference‑based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co‑occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de‑emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language‑model fine‑tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj‑hwang/ReCouPLe

Authors:Rui Zhao, Bin Shi, Kai Sun, Bo Dong
Title: Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Abstract:
Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real‑world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance‑dependent PLL (ID‑PLL), a setting that more accurately reflects this relationship. A significant challenge in ID‑PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class‑specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra‑ and inter‑class regulations. For intra‑class regulation, CAD amplifies class‑specific features to generate class‑wise augmentations and aligns same‑class augmentations across instances. For inter‑class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter‑class distances. By jointly applying intra‑ and inter‑class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID‑PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.

Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang
Title: Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Abstract:
The limited understanding capacity of the visual encoder in Contrastive Language‑Image Pre‑training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D‑Ability), which reflects class separability, and Detail Perceptual Ability (P‑Ability), which focuses on fine‑grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D‑Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion‑based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D‑Ability and P‑Ability. Extensive experiments across various benchmarks and multi‑modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.

Authors:Shaocheng Lan, Shuqi Gu, Zhangzhi Xiong, Kan Ren
Title: ConTSG-Bench: A Unified Benchmark for Conditional Time Series Generation
Abstract:
Conditional time series generation plays a critical role in addressing data scarcity and enabling causal analysis in real‑world applications. Despite its increasing importance, the field lacks a standardized and systematic benchmarking framework for evaluating generative models across diverse conditions. To address this gap, we introduce the Conditional Time Series Generation Benchmark (ConTSG‑Bench). ConTSG‑Bench comprises a large‑scale, well‑aligned dataset spanning diverse conditioning modalities and levels of semantic abstraction, first enabling systematic evaluation of representative generation methods across these dimensions with a comprehensive suite of metrics for generation fidelity and condition adherence. Both the quantitative benchmarking and in‑depth analyses of conditional generation behaviors have revealed the traits and limitations of the current approaches, highlighting critical challenges and promising research directions, particularly with respect to precise structural controllability and downstream task utility under complex conditions.

Authors:Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, Mengdi Wang
Title: Interactive Benchmarks
Abstract:
Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long‑horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench

Authors:Jihoon Jeong
Title: Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models
Abstract:
Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models ‑‑ like biological organisms ‑‑ have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions ‑‑ Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora‑12 program, explaining how model behavior emerges from Core‑‑Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open‑source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five‑layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M‑CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis ‑‑ a biologically‑inspired three‑layer parameter architecture ‑‑ and a therapeutic framework connecting diagnosis to treatment.

Authors:Ancymol Thomas, Jaya Sreevalsan-Nair
Title: Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data
Abstract:
Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi‑class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self‑ and cross‑attention mechanisms (FM2), (iii) with the multi‑scale Gaussian filtered images (FM3), and (iv) weighted decision‑level fusion (FM4). Ablation experiments are conducted to study the pixel‑, feature‑, and decision‑level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC‑MultiModalHybridFusion

Authors:Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held
Title: Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
Abstract:
We introduce Latent Particle World Model (LPWM), a self‑supervised object‑centric world model scaled to real‑world multi‑object datasets and applicable in decision‑making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end‑to‑end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state‑of‑the‑art results on diverse real‑world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision‑making, including goal‑conditioned imitation learning, as we demonstrate in the paper. Code, data, pre‑trained models and video rollouts are available: https://taldatech.github.io/lpwm‑web

Authors:Faisal Bin Ashraf, Animesh Ray, Stefano Lonardi
Title: AbAffinity: A Large Language Model for Predicting Antibody Binding Affinity against SARS-CoV-2
Abstract:
Machine learning‑based antibody design is emerging as one of the most promising approaches to combat infectious diseases, due to significant advancements in the field of artificial intelligence and an exponential surge in experimental antibody data (in particular related to COVID‑19). The ability of an antibody to bind to an antigens (called binding affinity) is one of the the most critical properties in designing neutralizing antibodies. In this study we introduce Ab‑Affinity, a new large language model that can accurately predict the binding affinity of antibodies against a target peptide, e.g., the SARS‑CoV‑2 spike protein. Code and model are available at https://github.com/ucrbioinfo/AbAffinity.

Authors:Yakov Pyotr Shkolnikov
Title: Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices
Abstract:
Multi‑agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10‑agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re‑prefill through the model ‑‑ 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4‑bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per‑agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross‑phase context injection that accumulates attention state across conversation phases without re‑computation. Evaluated on three architectures (Gemma 3 12B, dense GQA, 48 layers; DeepSeek‑Coder‑V2‑Lite 16B, MoE MLA, 27 layers; Llama 3.1 8B, dense GQA, 32 layers), cache restoration reduces time‑to‑first‑token by up to 136x (Gemma: 22‑‑136x at 4K‑‑32K; DeepSeek: 11‑‑76x at 4K‑‑32K; Llama: 24‑‑111x at 4K‑‑16K; 3‑‑10x at 1K). Q4 quantization fits 4x more agent contexts into fixed device memory than FP16. Perplexity measured with actual Q4 KV caches shows ‑0.7% for Gemma, +2.8% for Llama, and +3.0% for DeepSeek. Open‑source at https://github.com/yshk‑mxim/agent‑memory

Authors:Murad Farzulla
Title: Context-Dependent Affordance Computation in Vision-Language Models
Abstract:
We characterize the phenomenon of context‑dependent affordance computation in vision‑language models (VLMs). Through a large‑scale computational study (n=3,213 scene‑context pairs from COCO‑2017) using Qwen‑VL 30B and LLaVA‑1.5‑13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context‑dependent. Sentence‑level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context‑dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within‑prime variance is substantially lower than cross‑prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child‑mobility contrasts. These findings establish that VLMs compute affordances in a substantially context‑dependent manner ‑‑ with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts ‑‑ and suggest a direction for robotics research: dynamic, query‑dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.

Authors:Ekansh Arora
Title: Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology
Abstract:
Foundation models are increasingly applied to computational pathology, yet their behavior under cross‑cancer and cross‑species transfer remains unspecified. This study investigated how fine‑tuning CPath‑CLIP affects cancer detection under same‑cancer, cross‑cancer, and cross‑species conditions using whole‑slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few‑shot fine‑tuning improved same‑cancer (64.9% to 72.6% AUC) and cross‑cancer performance (56.84% to 66.31% AUC). Cross‑species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state‑of‑the‑art benchmarks (H‑optimus‑0: 84.97% AUC), indicating that standard vision‑language alignment is suboptimal for cross‑species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad‑CAM shows prototype‑based models remain domain‑locked, while language‑guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text‑alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H‑optimus‑0 shows that CPath‑CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same‑cancer (8.52%) and cross‑cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species‑dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re‑interpretation without retraining.

Authors:Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski
Title: ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
Abstract:
Feed‑forward transformer models have driven rapid progress in 3D vision, but state‑of‑the‑art methods such as VGGT and π^3 have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential‑reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed‑forward model that achieves linear‑time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic‑time methods. ZipMap employs test‑time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than 20× faster than state‑of‑the‑art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real‑time scene‑state querying and its extension to sequential streaming reconstruction.

Authors:Zachary Novack, Zack Zukowski, CJ Carr, Julian Parker, Zach Evans, Josiah Taylor, Taylor Berg-Kirkpatrick, Julian McAuley, Jordi Pons
Title: Low-Resource Guidance for Controllable Latent Audio Diffusion
Abstract:
Generative audio requires fine‑grained controllable outputs, yet most existing methods require model retraining on specific controls or inference‑time controls (e.g., guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance‑based controls, in particular their high cost‑per‑step due to decoder backpropagation, we introduce a guidance‑based approach through selective TFG and Latent‑Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and \approx 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end‑to‑end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.

Authors:Kelly L Vomo-Donfack, Adryel Hoszu, Grégory Ginot, Ian Morilla
Title: PTOPOFL: Privacy-Preserving Personalised Federated Learning via Persistent Homology
Abstract:
Federated learning (FL) faces two structural tensions: gradient sharing enables data‑reconstruction attacks, while non‑IID client distributions degrade aggregation quality. We introduce PTOPOFL, a framework that addresses both challenges simultaneously by replacing gradient communication with topological descriptors derived from persistent homology (PH). Clients transmit only 48‑dimensional PH feature vectors‑compact shape summaries whose many‑to‑one structure makes inversion provably ill‑posed‑rather than model gradients. The server performs topology‑guided personalised aggregation: clients are clustered by Wasserstein similarity between their PH diagrams, intra‑cluster models are topology‑weighted,and clusters are blended with a global consensus. We prove an information‑contraction theorem showing that PH descriptors leak strictly less mutual information per sample than gradients under strongly convex loss functions, and we establish linear convergence of the Wasserstein‑weighted aggregation scheme with an error floor strictly smaller than FedAvg. Evaluated against FedAvg, FedProx, SCAFFOLD, and pFedMe on a non‑IID healthcare scenario (8 hospitals, 2 adversarial) and a pathological benchmark (10 clients), PTOPOFL achieves AUC 0.841 and 0.910 respectively‑the highest in both settings‑while reducing reconstruction risk by a factor of 4.5 relative to gradient sharing. Code is publicly available at https://github.com/MorillaLab/TopoFederatedL and data at https://doi.org/10.5281/zenodo.18827595.

Authors:Pranav Kumar Kaliaperumal
Title: Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs
Abstract:
Post‑training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in work associated with Qualcomm AI Research. This paper provides a reproducible empirical reproduction and systems‑level extension of that phenomenon in BERT‑base fine‑tuned on QNLI. When global W8A8 quantization is applied, validation accuracy drops sharply from 89.66% (FP32) to 54.33%, a decrease of 35.33 points. Statistical analysis of FP32 activations shows strongly heavy‑tailed behavior that intensifies with model depth: kurtosis reaches 271 in the final layers and approximately 55% of activation energy is concentrated in the top 1% of channels. We evaluate several mitigation strategies. Mixed precision PTQ restores accuracy close to the FP32 baseline (89.42%). Per‑embedding‑group (PEG) quantization shows strong sensitivity to grouping structure, improving accuracy from 66.12% with three groups to 86.18% with four groups. In contrast, percentile‑based calibration, even at thresholds between 99.0 and 99.99, fails to recover accuracy (about 50.54%), indicating that large activation channels encode structured signal rather than rare noise. Deployment profiling on an RTX 3050 GPU shows minimal differences in latency and memory usage across methods (median latency about 58‑59 ms; VRAM usage about 484‑486 MB), highlighting the importance of hardware‑aware evaluation. Overall, the results show that PTQ failure in transformers is primarily driven by structured channel dominance amplified through residual connections. Effective mitigation therefore requires channel‑aware precision allocation rather than scalar clipping alone.

Authors:Ioannis Prokopiou, Ioannis Sina, Agisilaos Kounelis, Pantelis Vikatos, Themos Stafylakis
Title: LabelBuddy: An Open Source Music and Audio Language Annotation Tagging Tool Using AI Assistance
Abstract:
The advancement of Machine learning (ML), Large Audio Language Models (LALMs), and autonomous AI agents in Music Information Retrieval (MIR) necessitates a shift from static tagging to rich, human‑aligned representation learning. However, the scarcity of open‑source infrastructure capable of capturing the subjective nuances of audio annotation remains a critical bottleneck. This paper introduces LabelBuddy, an open‑source collaborative auto‑tagging audio annotation tool designed to bridge the gap between human intent and machine understanding. Unlike static tools, it decouples the interface from inference via containerized backends, allowing users to plug in custom models for AI‑assisted pre‑annotation. We describe the system architecture, which supports multi‑user consensus, containerized model isolation, and a roadmap for extending agents and LALMs. Code available at https://github.com/GiannisProkopiou/gsoc2022‑Label‑buddy.

Authors:Han Xiao
Title: mlx-vis: GPU-Accelerated Dimensionality Reduction and Visualization on Apple Silicon
Abstract:
mlx‑vis is a Python library that implements six dimensionality reduction methods and a k‑nearest neighbor graph algorithm entirely in MLX, Apple's array framework for Apple Silicon. The library provides UMAP, t‑SNE, PaCMAP, TriMap, DREAMS, CNE, and NNDescent, all executing on Metal GPU through a unified fit_transform interface. Beyond embedding computation, mlx‑vis includes a GPU‑accelerated circle‑splatting renderer that produces scatter plots and smooth animations without matplotlib, composing frames via scatter‑add alpha blending on GPU and piping them to hardware H.264 encoding. On Fashion‑MNIST with 70,000 points, all methods complete embedding in 2.1‑3.8 seconds and render 800‑frame animations in 1.4 seconds on an M3 Ultra, with the full pipeline from raw data to rendered video finishing in 3.6‑5.2 seconds. The library depends only on MLX and NumPy, is released under the Apache 2.0 license, and is available at https://github.com/hanxiao/mlx‑vis.

Authors:Yang Li, Youyang Sha, Yinzhi Wang, Timothy Hospedales, Xi Shen, Shell Xu Hu, Xuanlong Yu
Title: From Misclassifications to Outliers: Joint Reliability Assessment in Classification
Abstract:
Building reliable classifiers is a fundamental challenge for deploying machine learning in real‑world applications. A reliable system should not only detect out‑of‑distribution (OOD) inputs but also anticipate in‑distribution (ID) errors by assigning low confidence to potentially misclassified samples. Yet, most prior work treats OOD detection and failure prediction as separated problems, overlooking their closed connection. We argue that reliability requires evaluating them jointly. To this end, we propose a unified evaluation framework that integrates OOD detection and failure prediction, quantified by our new metrics DS‑F1 and DS‑AURC, where DS denotes double scoring functions. Experiments on the OpenOOD benchmark show that double scoring functions yield classifiers that are substantially more reliable than traditional single scoring approaches. Our analysis further reveals that OOD‑based approaches provide notable gains under simple or far‑OOD shifts, but only marginal benefits under more challenging near‑OOD conditions. Beyond evaluation, we extend the reliable classifier SURE and introduce SURE+, a new approach that significantly improves reliability across diverse scenarios. Together, our framework, metrics, and method establish a new benchmark for trustworthy classification and offer practical guidance for deploying robust models in real‑world settings. The source code is publicly available at https://github.com/Intellindust‑AI‑Lab/SUREPlus.

Authors:Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, Yuke Zhu
Title: Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Abstract:
Continual learning is a long‑standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large‑scale pretrained Vision‑Language‑Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large‑scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://ut‑austin‑rpl.github.io/continual‑vla

Authors:Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang
Title: Relational In-Context Learning via Synthetic Pre-training with Structural Prior
Abstract:
Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high‑quality RDBs are private, scarce and structurally heterogeneous, making internet‑scale pre‑training infeasible. To overcome this data scarcity, We introduce RDB‑PFN, the first relational foundation model trained purely via synthetic data. Inspired by Prior‑Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a Relational Prior Generator to create an infinite stream of diverse RDBs from scratch. Pre‑training on over 2 million synthetic single‑table and relational tasks, RDB‑PFN learns to adapt to any new database instantly via genuine in‑context learning. Experiments verify RDB‑PFN achieves strong few‑shot performance on 19 real‑world relational prediction tasks, outperforming graph‑based and single‑table foundation‑model baselines (given the same DFS‑linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN

Authors:Taejun Lim, Joong-Won Hwang, Kibok Lee
Title: When and Where to Reset Matters for Long-Term Test-Time Adaptation
Abstract:
When continual test‑time adaptation (TTA) persists over the long term, errors accumulate in the model and further cause it to predict only a few classes for all inputs, a phenomenon known as model collapse. Recent studies have explored reset strategies that completely erase these accumulated errors. However, their periodic resets lead to suboptimal adaptation, as they occur independently of the actual risk of collapse. Moreover, their full resets cause catastrophic loss of knowledge acquired over time, even though such knowledge could be beneficial in the future. To this end, we propose (1) an Adaptive and Selective Reset (ASR) scheme that dynamically determines when and where to reset, (2) an importance‑aware regularizer to recover essential knowledge lost due to reset, and (3) an on‑the‑fly adaptation adjustment scheme to enhance adaptability under challenging domain shifts. Extensive experiments across long‑term TTA benchmarks demonstrate the effectiveness of our approach, particularly under challenging conditions. Our code is available at https://github.com/YonseiML/asr.

Authors:Qi Zhang, Harsh Parikh, Ashley Naimi, Razieh Nabi, Christopher Kim, Timothy Lash
Title: Controllable Generative Sandbox for Causal Inference
Abstract:
Method validation and study design in causal inference rely on synthetic data with known counterfactuals. Existing simulators trade off distributional realism, the ability to capture mixed‑type and multimodal tabular data, against causal controllability, including explicit control over overlap, unmeasured confounding, and treatment effect heterogeneity. We introduce CausalMix, a variational generative framework that closes this gap by coupling a mixture of Gaussian latent priors with data‑type‑specific decoders for continuous, binary, and categorical variables. The model incorporates explicit causal controls: an overlap regularizer shaping propensity‑score distributions, alongside direct parameterizations of confounding strength and effect heterogeneity. This unified objective preserves fidelity to the observed data while enabling factorial manipulation of causal mechanisms, allowing overlap, confounding strength, and treatment effect heterogeneity to be varied independently at design time. Across benchmarks, CausalMix achieves state‑of‑the‑art distributional metrics on mixed‑type tables while providing stable, fine‑grained causal control. We demonstrate practical utility in a comparative safety study of metastatic castration‑resistant prostate cancer treatments, using CausalMix to compare estimators under calibrated data‑generating processes, tune hyperparameters, and conduct simulation‑based power analyses under targeted treatment effect heterogeneity scenarios.

Authors:Achleshwar Luthra, Yash Salunkhe, Tomer Galanti
Title: Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning
Abstract:
Frozen self‑supervised representations often transfer well with only a few labels across many semantic tasks. We argue that a single geometric quantity, \emphdirectional CDNV (decision‑axis variance), sits at the core of two favorable behaviors: strong few‑shot transfer within a task, and low interference across many tasks. We show that both emerge when variability \emphalong class‑separating directions is small. First, we prove sharp non‑asymptotic multiclass generalization bounds for downstream classification whose leading term is the directional CDNV. The bounds include finite‑shot corrections that cleanly separate intrinsic decision‑axis variability from centroid‑estimation error. Second, we link decision‑axis collapse to multitask geometry: for independent balanced labelings, small directional CDNV across tasks forces the corresponding decision axes to be nearly orthogonal, helping a single representation support many tasks with minimal interference. Empirically, across SSL objectives, directional CDNV collapses during pretraining even when classical CDNV remains large, and our bounds closely track few‑shot error at practical shot sizes. Additionally, on synthetic multitask data, we verify that SSL learns representations whose induced decision axes are nearly orthogonal. The code and project page of the paper are available at [\hrefhttps://dlfundamentals.github.io/directional‑neural‑collapse/project page].

Authors:Jiahao Qin
Title: mlx-snn: Spiking Neural Networks on Apple Silicon via MLX
Abstract:
We introduce mlx‑snn, the first spiking neural network (SNN) library built natively on Apple's MLX framework. As SNN research grows rapidly, all major libraries ‑‑ snnTorch, Norse, SpikingJelly, Lava ‑‑ target PyTorch or custom backends, leaving Apple Silicon users without a native option. mlx‑snn provides six neuron models (LIF, IF, Izhikevich, Adaptive LIF, Synaptic, Alpha), four surrogate gradient functions, four spike encoding methods (including an EEG‑specific encoder), and a complete backpropagation‑through‑time training pipeline. The library leverages MLX's unified memory architecture, lazy evaluation, and composable function transforms (mx.grad, mx.compile) to enable efficient SNN research on Apple Silicon hardware. We validate mlx‑snn on MNIST digit classification across five hyperparameter configurations and three backends, achieving up to 97.28% accuracy with 2.0‑‑2.5 times faster training and 3‑‑10 times lower GPU memory than snnTorch on the same M3 Max hardware. mlx‑snn is open‑source under the MIT license and available on PyPI. https://github.com/D‑ST‑Sword/mlx‑snn

Authors:Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian
Title: Beyond Pixel Histories: World Models with Persistent 3D State
Abstract:
Interactive world models continually generate video by responding to a user's actions, enabling open‑ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down‑stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long‑horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine‑grained, geometry‑aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

Authors:Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts
Title: [Re] FairDICE: A Gap Between Theory And Practice
Abstract:
Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi‑objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g.\ incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high‑dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

Authors:Krishna Sri Ipsit Mantri, Carola-Bibiane Schönlieb, Zorah Lähner, Moshe Eliasof
Title: Towards Improved Sentence Representations using Token Graphs
Abstract:
Obtaining a single‑vector representation from a Large Language Model's (LLM) token‑level outputs is a critical step for nearly all sentence‑level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self‑attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure‑aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token‑similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it is competitive with state‑of‑the‑art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter‑efficient fine‑tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs. Our code is published at https://github.com/ipsitmantri/GLOT.

Authors:Ashwath Vaithinathan Aravindan, Mayank Kejriwal
Title: Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Abstract:
Chain‑of‑Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnoteAssumed parameter count of closed models), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50‑60% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20‑30% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0‑6%) regardless of scale; Sycophancy produces modest effects (7% loss for small models); and SkippedSteps cause intermediate damage (15% loss). Scaling relationships follow power‑law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi‑stage reasoning pipelines and underscore the necessity of task‑specific robustness assessments and mitigation strategies. The code and results are available https://github.com/Mystic‑Slice/CoTPerturbation.

Authors:Xin Yang, Letian Li, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xunliang Cai, Wenyuan Jiang
Title: Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO
Abstract:
Large language models (LLMs) have demonstrated remarkable and steadily improving performance across a wide range of tasks. However, LLM performance may be highly sensitive to prompt variations especially in scenarios with limited openness or strict output formatting requirements, indicating insufficient robustness. In real‑world applications, user prompts provided to LLMs often contain imperfections, which may undermine the quality of the model's responses. To address this issue, previous work has primarily focused on preprocessing prompts, employing external tools or even LLMs to refine prompt formulations in advance. However, these approaches overlook the intrinsic robustness of LLMs, and their reliance on external components introduces additional computational overhead and uncertainty. In this work, we propose a Contrastive Learning‑based Inverse Direct Preference Optimization (CoIPO) method that minimizes the discrepancy between the label‑aligned logits produced by the model under a clean prompt and its noisy counterpart, and conduct a detailed analysis using mutual information theory. We augment the FLAN dataset by constructing paired prompts, each consisting of a clean prompt and its corresponding noisy version for training. Additionally, to evaluate the effectiveness, we develop NoisyPromptBench, a benchmark enhanced and derived from the existing PromptBench. Experimental results conducted on NoisyPromptBench demonstrate that our proposed method achieves a significant improvement in average accuracy over the current state‑of‑the‑art approaches. The source code of CoIPO, pair‑wise FLAN datasets, and NoisyPromptBench have already been released on https://github.com/vegetable‑yx/CoIPO.

Authors:Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Jingjing Wang, Xuanzhao Dong, Minzhou Huang, Rui Cai, Hejian Sang, Hao Wang, Peijie Qiu, Yueyue Deng, Prayag Tiwari, Brendan Hogan Rappazzo, Yalin Wang
Title: AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents
Abstract:
Long‑horizon LLM agents require memory systems that remain accurate under fixed context budgets. However, existing systems struggle with two persistent challenges in long‑term dialogue: (i) disconnected evidence, where multi‑hop answers require linking facts distributed across time, and (ii) state updates, where evolving information (e.g., schedule changes) creates conflicts with older static logs. We propose AriadneMem, a structured memory system that addresses these failure modes via a decoupled two‑phase pipeline. In the offline construction phase, AriadneMem employs \emphentropy‑aware gating to filter noise and low‑information message before LLM extraction and applies \emphconflict‑aware coarsening to merge static duplicates while preserving state transitions as temporal edges. In the online reasoning phase, rather than relying on expensive iterative planning, AriadneMem executes \emphalgorithmic bridge discovery to reconstruct missing logical paths between retrieved facts, followed by \emphsingle‑call topology‑aware synthesis. On LoCoMo experiments with GPT‑4o, AriadneMem improves Multi‑Hop F1 by 15.2% and Average F1 by 9.0% over strong baselines. Crucially, by offloading reasoning to the graph layer, AriadneMem reduces total runtime by 77.8% using only 497 context tokens. The code is available at https://github.com/LLM‑VLM‑GSL/AriadneMem.

Authors:Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, Yueqi Duan
Title: CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
Abstract:
Classifier‑Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow‑based diffusion models. In this paper, we explore a unified framework called CFG‑Ctrl, which reinterprets CFG as a control applied to the first‑order continuous‑time generative flow, using the conditional‑unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P‑control) with fixed gain, and typical follow‑up variants develop extended control‑law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC‑CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback‑guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite‑time convergence. Experiments across text‑to‑image generation models including Stable Diffusion 3.5, Flux, and Qwen‑Image demonstrate that SMC‑CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: https://hanyang‑21.github.io/CFG‑Ctrl

Authors:Toru Lin, Shuying Deng, Zhao-Heng Yin, Pieter Abbeel, Jitendra Malik
Title: How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
Abstract:
Many essential manipulation tasks ‑ such as food preparation, surgery, and craftsmanship ‑ remain intractable for autonomous robots. These tasks are characterized not only by contact‑rich, force‑sensitive dynamics, but also by their "implicit" success criteria: unlike pick‑and‑place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two‑stage pipeline: first, we learn a robust initial policy via force‑aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference‑based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50‑200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference‑based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero‑shot generalization to unseen in‑category instances and to out‑of‑distribution produce from different categories while maintaining over 90% success rates.

Authors:Jessie Z. Li, Zhiqing Hong, Toru Shirakawa, Serina Chang
Title: Learning Demographic-Conditioned Mobility Trajectories with Aggregate Supervision
Abstract:
Human mobility trajectories are widely studied in public health and social science, where different demographic groups exhibit significantly different mobility patterns. However, existing trajectory generation models rarely capture this heterogeneity because most trajectory datasets lack demographic labels. To address this gap in data, we propose ATLAS, a weakly supervised approach for demographic‑conditioned trajectory generation using only (i) individual trajectories without demographic labels, (ii) region‑level aggregated mobility features, and (iii) region‑level demographic compositions from census data. ATLAS trains a trajectory generator and fine‑tunes it so that simulated mobility matches observed regional aggregates while conditioning on demographics. Experiments on real trajectory data with demographic labels show that ATLAS substantially improves demographic realism over baselines (JSD \downarrow 12%‑‑69%) and closes much of the gap to strongly supervised training. We further develop theoretical analyses for when and why ATLAS works, identifying key factors including demographic diversity across regions and the informativeness of the aggregate feature, paired with experiments demonstrating the practical implications of our theory. We release our code at https://github.com/schang‑lab/ATLAS.

Authors:Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao
Title: Step-Level Sparse Autoencoder for Reasoning Process Interpretation
Abstract:
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain‑of‑Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step‑level information, such as reasoning direction and semantic transitions. In this work, we propose step‑level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface‑level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self‑verification ability of LLMs. The code is available at https://github.com/Miaow‑Lab/SSAE

Authors:Huanlei Guo, Hongxin Wei, Bingyi Jing
Title: Toward Early Quality Assessment of Text-to-Image Diffusion Models
Abstract:
Recent text‑to‑image (T2I) diffusion and flow‑matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate‑‑then‑‑select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource‑intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post‑hoc. In this work, we address this inefficiency by introducing Probe‑Select, a plug‑in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement‑‑that strongly correlates with final image fidelity. Probe‑Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow‑matching backbones, our experiments show that early evaluation at only 20% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

Authors:Erik Hartman, Di Tang, Johan Malmström
Title: Deep learning-guided evolutionary optimization for protein design
Abstract:
Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence‑function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data‑efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of Streptococcus pneumoniae. BoGA accelerates the discovery of high‑confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \hrefhttps://github.com/ErikHartman/bopepGitHub.

Authors:Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng
Title: SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety
Abstract:
Vision‑language models remain susceptible to multimodal jailbreaks and over‑refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR‑ToolKit, which formalizes safety decision‑making as a checkable protocol. Concretely, a planner specifies a persona, a Perception \to Reasoning \to Decision tool set, and a constrained transition graph, while a responder outputs a typed key‑value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three‑stage curriculum (SFT \to DPO \to GRPO), where GRPO directly supervises tool usage beyond answer‑level feedback. Our contributions are two‑fold: I. Dataset. The first tool‑based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held‑out evaluation. II. Experiments. On Qwen2.5‑VL, SaFeR‑ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 \to 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 \to 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 \to 59.21; 7B: 66.39 \to 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

Authors:Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye
Title: Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
Abstract:
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self‑improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language‑based self‑improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self‑Taught Reasoner (VC‑STaR), a novel self‑improving framework that leverages visual contrast to mitigate hallucinations in model‑generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi‑modal similarity, and generate rationales using VC‑STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR‑55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC‑STaR not only outperforms existing self‑improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC‑STaR.

Authors:Xuejin Luo, Shiquan Sun, Runshi Zhang, Ruizhi Zhang, Junchen Wang
Title: Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery
Abstract:
During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision‑free dual‑arm surgical assistive robot capable of performing instrument delivery. A vision‑language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero‑shot manner based on surgeons' instructions. A real‑time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self‑collision prevention during the dual‑arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision‑free movement throughout all trials. The project page and source code are available at https://give‑me‑scissors.github.io/.

Authors:Liu Yang, Zeyu Nie, Andrew Liu, Felix Zou, Deniz Altinbüken, Amir Yazdanbakhsh, Quanquan C. Liu
Title: ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution
Abstract:
The transition from sequential to parallel computing is essential for modern high‑performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non‑uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub‑optimal scaling. We bridge this gap with ParEVO, a framework designed to synthesize high‑performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay‑Instruct Corpus, a curated dataset of 13,820 tasks synthesized via a "Critic‑Refine" pipeline that explicitly filters for empirically performant algorithms that effectively utilize Work‑Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine‑tuned to align probabilistic generation with the rigorous semantics of the ParlayLib library; and (3) an Evolutionary Coding Agent (ECA) that improves the "last mile" of correctness by iteratively repairing code using feedback from compilers, dynamic race detectors, and performance profilers. On the ParEval benchmark, ParEVO achieves an average 106x speedup (with a maximum of 1103x) across the suite, and a robust 13.6x speedup specifically on complex irregular graph problems, outperforming state‑of‑the‑art commercial models. Furthermore, our evolutionary approach matches state‑of‑the‑art expert human baselines, achieving up to a 4.1x speedup on specific highly‑irregular kernels. Source code and datasets are available at https://github.com/WildAlg/ParEVO.

Authors:Semih Cantürk, Thomas Sabourin, Frederik Wenkel, Michael Perlmutter, Guy Wolf
Title: Can Computational Reducibility Lead to Transferable Models for Graph Combinatorial Optimization?
Abstract:
A key challenge in deriving unified neural solvers for combinatorial optimization (CO) is efficient generalization of models between a given set of tasks to new tasks not used during the initial training process. To address it, we first establish a new model, which uses a GCON module as a form of expressive message passing together with energy‑based unsupervised loss functions. This model achieves high performance (often comparable with state‑of‑the‑art results) across multiple CO tasks when trained individually on each task. We then leverage knowledge from the computational reducibility literature to propose pretraining and fine‑tuning strategies that transfer effectively (a) between MVC, MIS and MaxClique, and (b) in a multi‑task learning setting that additionally incorporates MaxCut, MDS and graph coloring. Additionally, in a leave‑one‑out, multi‑task learning setting, we observe that pretraining on all but one task almost always leads to faster convergence on the remaining task when fine‑tuning while avoiding negative transfer. Our findings indicate that learning common representations across multiple graph CO problems is viable through the use of expressive message passing coupled with pretraining strategies that are informed by the polynomial reduction literature, thereby taking an important step towards enabling the development of foundational models for neural CO. We provide an open‑source implementation of our work at https://github.com/semihcanturk/COPT‑MT .

Authors:Zhanghan Ni, Yanjing Li, Zeju Qiu, Bernhard Schölkopf, Hongyu Guo, Weiyang Liu, Shengchao Liu
Title: Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
Abstract:
Generative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non‑rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity‑Aware Self‑Supervised Learning), a geometric pretraining framework that front‑loads geometry learning prior to generative finetuning. Phase I (RigidSSL‑Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL‑MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi‑directional, rigidity‑aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL‑Perturb improves the success rate by 5.8% in zero‑shot motif scaffolding and RigidSSL‑MD captures more biophysically realistic conformational ensembles in G protein‑coupled receptor modeling.

Authors:Marta Grzeskiewicz
Title: Neural Demand Estimation with Habit Formation and Rationality Constraints
Abstract:
We develop a flexible neural demand system for continuous budget allocation that estimates budget shares on the simplex by minimizing KL divergence. Shares are produced via a softmax of a state‑dependent preference scorer and disciplined with regularity penalties (monotonicity, Slutsky symmetry) to support coherent comparative statics and welfare without imposing a parametric utility form. State dependence enters through a habit stock defined as an exponentially weighted moving average of past consumption. Simulations recover elasticities and welfare accurately and show sizable gains when habit formation is present. In our empirical application using Dominick's analgesics data, adding habit reduces out‑of‑sample error by c.33%, reshapes substitution patterns, and increases CV losses from a 10% ibuprofen price rise by about 15‑16% relative to a static model. The code is available at https://github.com/martagrz/neural_demand_habit .

Authors:Kyle Elliott Mathewson
Title: Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry
Abstract:
Do neural machine translation models learn language‑universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta's NLLB‑200, a 200‑language encoder‑decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program (ρ= 0.13, p = 0.020), demonstrating that NLLB‑200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non‑colexified pairs (U = 42656, p = 1.33 × 10^‑11, d = 0.96), indicating that the model has internalized universal conceptual associations. Per‑language mean‑centering of embeddings improves the between‑concept to within‑concept distance ratio by a factor of 1.19, providing geometric evidence for a language‑neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross‑lingual consistency (mean cosine = 0.84), suggesting that second‑order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open‑source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.

Authors:Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwutthinon, Yehan Ma, An Zou
Title: CUDABench: Benchmarking LLMs for Text-to-CUDA Generation
Abstract:
Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high‑level languages into CUDA, overlooking the more general and challenging task of text‑to‑CUDA generation. Furthermore, given the hardware‑specific and performance‑critical features of GPU programming, accurately assessing the performance of LLM‑generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text‑to‑CUDA capabilities of LLMs. First, we construct CUDABench‑Set, which covers Breadth‑Depth‑Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench‑Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution‑based verification, and (3) a novel roofline‑based metric, Performance‑Score. Benchmarking state‑of‑the‑art LLMs reveals insightful findings and challenges of text‑to‑CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain‑specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available at https://github.com/CUDA‑Bench/CUDABench.

Authors:Ran Li, Shimin Di, Haowei LI, Luanshi Bu, Jiachuan Wang, Wangze Ni, Lei Chen
Title: RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning
Abstract:
Chemical reaction prediction is pivotal for accelerating drug discovery and synthesis planning. Despite advances in data‑driven models, current approaches are hindered by an overemphasis on parameter and dataset scaling. Some methods coupled with evaluation techniques that bypass fundamental challenges in reaction representation and fail to capture deep chemical intuition like reaction common sense and topological atom mapping logic. We argue that the core challenge lies in instilling these knowledge into the models. To this end, we propose a unified framework that prioritizes chemical understanding over scale through three key innovations: (1) a Latent Chemical Consistency objective that models reactions as movements on a continuous chemical manifold, ensuring reversible and physically plausible transformations; (2) a Hierarchical Cognitive Curriculum that trains the model through progressive stages, from syntax mastery to semantic reasoning, building robust chemical intuition; (3) Atom‑Map Permutation Invariance (AMPI), which force the model to learn invariant relational topology and balance multi‑task learning. (4)and structured plan‑based reasoning to improve the performance of the LLMs. Our compact 0.5B‑parameter model, RxnNano significantly outperforms fine‑tuned LLMs ten times larger (>7B) and all the domain baselines, achieving a 23.5% Top‑1 accuracy improvement on rigorous benchmarks without test‑time augmentation. https://github.com/rlisml/RxnNano.

Authors:Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang
Title: Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
Abstract:
General‑purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame‑level supervision. While effective for expert demonstrations, this paradigm scales poorly to large‑scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra‑trajectory progress supervision with inter‑trajectory preference supervision. Robometer is trained with a dual objective: a frame‑level progress loss that anchors reward magnitude on expert data, and a trajectory‑comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM‑1M, a reward‑learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real‑world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

Authors:Harikrishnan Unnikrishnan
Title: A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment
Abstract:
Background: Accurate glottal segmentation in high‑speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non‑glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection‑gated pipeline that integrates a localizer with a segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and occlusion. The segmenter was trained on a limited subset of the GIRAFE dataset (600 frames), while the localizer was trained on the BAGLS training set. The in‑distribution localizer provides a tight region of interest (ROI), removing geometric anatomical variations and enabling cross‑dataset generalization without fine‑tuning. Results: The pipeline achieved state‑of‑the‑art performance on the GIRAFE (DSC=0.81) and BAGLS (DSC=0.85) benchmarks and demonstrated superior generalizability. Notably, the framework maintained robust cross‑dataset generalization (DSC=0.77). Downstream validation on a 65‑subject clinical cohort confirmed that automated kinematic features ‑ specifically the Open Quotient and Glottal Area Waveform (GAW) ‑ remained consistent with clinical benchmarks. The coefficient of variation (CV) of the glottal area was a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: This architecture provides a computationally efficient solution (~35 frames/s) suitable for real‑time clinical use. By overcoming cross‑dataset variability, this framework facilitates the standardized, large‑scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari‑krishnan/openglottal.

Authors:Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Title: KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
Abstract:
Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed KDFlow, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero‑copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off‑policy and on‑policy distillation and incorporates KD algorithms for cross‑tokenizer KD through highly extensible and user‑friendly APIs. Experiments show that KDFlow can achieve 1.44× to 6.36× speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow

Authors:Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa
Title: Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning
Abstract:
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble‑based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter‑policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample‑efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper‑cpo/ .

Authors:Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian
Title: FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
Abstract:
Fine‑tuning large language models for vertical domains remains labor‑intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning and language agents, end‑to‑end LLM fine‑tuning has not been systematically studied as an interactive agent task. We introduce FT‑Dojo, an interactive benchmark environment for autonomous LLM fine‑tuning, comprising 13 tasks across 5 domains. Rather than a new collection of static datasets, FT‑Dojo standardizes a task interface, shared raw‑data repository, sandboxed execution environment, structured feedback protocol, and held‑out evaluation procedure. We further develop FT‑Agent, a fine‑tuning‑oriented autonomous framework that uses structured iteration planning, fail‑fast validation, and multi‑level feedback analysis to refine data and training strategies. Experiments show that FT‑Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open‑source planning backbones, and multi‑run statistics supporting the main findings. Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long‑horizon planning. The implementation is available at https://github.com/microsoft/rd‑agent.

Authors:Yifei Zhang, Xu Yang, Xiao Yang, Bowen Xian, Qizheng Li, Shikai Fang, Jingyuan Li, Jian Wang, Mingrui Xu, Weiqing Liu, Jiang Bian
Title: Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Abstract:
LLM‑based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient‑free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce \textscGome, an MLE agent that operationalizes gradient‑based optimization. \textscGome maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi‑trace execution to distributed optimization. Under a closed‑world protocol that isolates architectural effects from external knowledge, \textscGome achieves a state‑of‑the‑art 35.1% any‑medal rate on MLE‑Bench with a restricted 12‑hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient‑based optimization progressively outperforms, with the gap widening at frontier‑tier models. Given the rapid advancement of reasoning‑oriented LLMs, this positions gradient‑based optimization as an increasingly favorable paradigm. We release our codebase and GPT‑5 traces at https://github.com/microsoft/RD‑Agent.

Authors:Jérome Eertmans, Enrico M. Vitucci, Vittorio Degli-Esposti, Nicola Di Cicco, Laurent Jacques, Claude Oestges
Title: Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling
Abstract:
Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the interaction order. This bottleneck limits its use in large‑scale or real‑time applications, forcing traditional tools to rely on heuristics that reduce path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a machine‑learning‑assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying these generative models to this domain presents challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high‑order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key components. First, an \emphexperience replay buffer captures and retains rare valid paths. Second, a uniform exploratory policy improves generalization and prevents overfitting to simple geometries. Third, a physics‑based action masking strategy filters out physically impossible paths before the model considers them. Validated on idealized street‑canyon scenarios, our model achieves substantial speedups over exhaustive search ‑‑ up to 10× faster on GPU and 100× faster on CPU ‑‑ while maintaining high coverage accuracy and successfully uncovering complex propagation paths. However, out‑of‑distribution evaluations on real‑world Manhattan street geometries reveal that generalizing to substantially different urban morphologies requires further advancement in model capacity or alternative training strategies. Source code, tests, and a tutorial are available at https://github.com/jeertmans/sampling‑paths.

Authors:Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu
Title: CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
Abstract:
Large visual language models (VLMs) have shown strong multi‑modal medical reasoning ability, but most operate as end‑to‑end black boxes, diverging from clinicians' evidence‑based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi‑modal medical Reasoning with an Evidence‑grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub‑modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity‑referring segmentation model produces pixel‑level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence‑answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE‑Flow (coordinator‑free) improves average accuracy by 10.9% over the same size (10B) state‑of‑the‑art (SOTA). With dynamic planning and answer review, our CARE‑Coord yields a further gain, outperforming the heavily pre‑trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI. Project page: https://xypb.github.io/CARE‑Project‑Page/

Authors:Guang Huang, Zeyi Wen
Title: Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
Abstract:
Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self‑speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory‑bandwidth bound, fundamentally limiting the maximum achievable speedup.In this paper, we introduce Quasar (Quantized Self‑speculative Acceleration for Rapid Inference), a novel, training‑free framework designed to overcome this "memory wall" by employing low‑bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization‑based verification preserves the logit distribution with high fidelity while effectively halving memory traffic. Extensive experiments on state‑of‑the‑art models (e.g., OpenPangu and Qwen3) demonstrate that Quasar maintains a speculative acceptance length comparable to full‑precision methods while achieving a 1.28× improvement in end‑to‑end throughput. Being orthogonal to existing drafting strategies, Quasar offers a generic and efficient pathway to accelerate the verification leg of speculative execution. Code is available at https://github.com/Tom‑HG/Quasar.

Authors:Mehdi Makni, Xiang Meng, Rahul Mazumder
Title: 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs
Abstract:
Sparse plus Low‑Rank (\mathbfS + \mathbfLR) decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre‑trained model weights into a sum of sparse and low‑rank matrices (\mathbfW \approx \mathbfS + \mathbfLR). Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL‑TM, an efficient one‑shot post‑training method for (\mathbfS + \mathbfLR) decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3‑Block Alternating Direction Method of Multipliers (ADMM) method, termed 3BASiL, to minimize the layer‑wise reconstruction error with convergence guarantees. We then design an efficient transformer‑matching (TM) refinement step that jointly optimizes the sparse and low‑rank components across transformer layers. This step minimizes a novel memory‑efficient loss that aligns outputs at the transformer level. Notably, the TM procedure is universal as it can enhance any (\mathbfS + \mathbfLR) decomposition, including pure sparsity. Our numerical experiments show that 3BASiL‑TM reduces the WikiText2 perplexity gap relative to dense LLaMA‑8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA (\mathbfS + \mathbfLR) method. Our code is available at https://github.com/mazumder‑lab/3BASiL.

Authors:Maifang Zhang, Hang Yu, Qian Zuo, Cheng Wang, Vaishak Belle, Fengxiang He
Title: Integrating LTL Constraints into PPO for Safe Reinforcement Learning
Abstract:
This paper proposes Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO‑LTL), a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning. LTL constraints offer rigorous representations of complex safety requirements, such as regulations that broadly exist in robotics, enabling systematic monitoring of safety requirements. Violations against LTL constraints are monitored by limit‑deterministic Büchi automata, and then translated by a logic‑to‑cost mechanism into penalty signals. The signals are further employed for guiding the policy optimization via the Lagrangian scheme. Extensive experiments on the Zones and CARLA environments show that our PPO‑LTL can consistently reduce safety violations, while maintaining competitive performance, against the state‑of‑the‑art methods. The code is at https://github.com/EVIEHub/PPO‑LTL.

Authors:Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
Title: JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
Abstract:
Fake news undermines societal trust and decision‑making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region‑specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi‑lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak‑induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub‑metrics through LLM‑as‑a‑Judge and 5 jailbreak attacks, with approximately 300k instances. Our evaluation of 9 LLMs reveals that the maximum attack success rate (ASR) reached 86.3% and the maximum harmfulness score was 3.5 out of 5. Notably, for English and U.S.‑related topics, the defensive performance of typical multi‑lingual LLMs was significantly lower than for other regions, highlighting substantial imbalances in safety across languages and regions. In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias. Our dataset and code are available at https://github.com/kanekomasahiro/jail_news_bench.

Authors:Oscar Rivera, Ziqing Wang, Matthieu Dagommer, Abhishek Pandey, Kaize Ding
Title: GlassMol: Interpretable Molecular Property Prediction with Concept Bottleneck Models
Abstract:
Machine learning accelerates molecular property prediction, yet state‑of‑the‑art Large Language Models and Graph Neural Networks operate as black boxes. In drug discovery, where safety is critical, this opacity risks masking false correlations and excluding human expertise. Existing interpretability methods suffer from the effectiveness‑trustworthiness trade‑off: explanations may fail to reflect a model's true reasoning, degrade performance, or lack domain grounding. Concept Bottleneck Models (CBMs) offer a solution by projecting inputs to human‑interpretable concepts before readout, ensuring that explanations are inherently faithful to the decision process. However, adapting CBMs to chemistry faces three challenges: the Relevance Gap (selecting task‑relevant concepts from a large descriptor space), the Annotation Gap (obtaining concept supervision for molecular data), and the Capacity Gap (degrading performance due to bottleneck constraints). We introduce GlassMol, a model‑agnostic CBM that addresses these gaps through automated concept curation and LLM‑guided concept selection. Experiments across thirteen benchmarks demonstrate that \method generally matches or exceeds black‑box baselines, suggesting that interpretability does not sacrifice performance and challenging the commonly assumed trade‑off. Code is available at https://github.com/walleio/GlassMol.

Authors:Gaojie Jin, Xinping Yi, Wei Huang, Sven Schewe, Xiaowei Huang
Title: S2O: Enhancing Adversarial Training with Second-Order Statistics of Weights
Abstract:
Adversarial training has emerged as a highly effective way to improve the robustness of deep neural networks (DNNs). It is typically conceptualized as a min‑max optimization problem over model weights and adversarial perturbations, where the weights are optimized using gradient descent methods, such as SGD. In this paper, we propose a novel approach by treating model weights as random variables, which paves the way for enhancing adversarial training through Second‑Order Statistics Optimization (S^2O) over model weights. We challenge and relax a prevalent, yet often unrealistic, assumption in prior PAC‑Bayesian frameworks: the statistical independence of weights. From this relaxation, we derive an improved PAC‑Bayesian robust generalization bound. Our theoretical developments suggest that optimizing the second‑order statistics of weights can substantially tighten this bound. We complement this theoretical insight by conducting an extensive set of experiments that demonstrate that S^2O not only enhances the robustness and generalization of neural networks when used in isolation, but also seamlessly augments other state‑of‑the‑art adversarial training techniques. The code is available at https://github.com/Alexkael/S2O.

Authors:Changwoo Baek, Jouwon Song, Sohyeon Kim, Kyeongbo Kong
Title: AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
Abstract:
Large Vision‑Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention‑based or diversity‑based pruning methods, in‑depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank‑based quantitative analysis shows that many diversity‑oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention‑based pruning. (2) We further observe that attention‑based approaches are more effective on simple images where visual evidence is concentrated, while diversity‑based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image‑aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination‑specific evaluations. Our project page available at https://cvsp‑lab.github.io/AgilePruner.

Authors:Victor May, Aaditya Salgarkar, Yishan Wang, Diganta Misra, Huu Nguyen
Title: Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
Abstract:
Tool‑augmented LLMs are increasingly deployed as agents that interleave natural‑language reasoning with executable Python actions, as in CodeAct‑style frameworks. In deployment, these agents rely on runtime state that persists across steps. By contrast, the traces used to post‑train these models rarely encode how interpreter state is managed. We ask whether interpreter persistence is merely a runtime scaffold, or a property of the training data that shapes how agents learn to use the interpreter. We isolate state persistence as a training‑time variable. We introduce Opaque Knapsack, a procedurally generated family of partially observable optimization tasks designed to prevent one‑shot solutions. Item attributes and constraints are hidden behind budgeted tool calls, forcing multi‑turn control flow and iterative state revision. Holding task instances, prompts, tools, model, and supervision fixed, we generate matched trajectories differing only in whether interpreter state persists across steps or resets after each action. We then fine‑tune identical base models (Qwen3‑8B) on each trace variant and evaluate all four train‑runtime combinations. Our 2x2 cross‑evaluation shows that interpreter persistence shapes how agents reach solutions, not whether they do: solution quality is statistically indistinguishable across conditions, but token cost and stability differ substantially. A persistent‑trained model in a stateless runtime triggers missing‑variable errors in roughly 80% of episodes; a stateless‑trained model in a persistent runtime redundantly re‑derives retained state, using roughly 3.5x more tokens. Interpreter persistence should be treated as a first‑class semantic of agent traces. Aligning fine‑tuning data with deployment runtimes improves efficiency and reduces brittle train‑runtime mismatches.

Authors:Hrishikesh Viswanath, Hong Chul Nam, Xi Deng, Julius Berner, Anima Anandkumar, Aniket Bera
Title: Operator Learning Using Weak Supervision from Walk-on-Spheres
Abstract:
Training neural PDE solvers is often bottlenecked by expensive data generation or unstable physics‑informed neural network (PINN) involving challenging optimization landscapes due to higher‑order derivatives. To tackle this issue, we propose an alternative approach using Monte Carlo approaches to estimate the solution to the PDE as a stochastic process for weak supervision during training. Leveraging the Walk‑on‑Spheres method, we introduce a learning scheme called \emphWalk‑on‑Spheres Neural Operator (WoS‑NO) which uses weak supervision from WoS to train any given neural operator. We propose to amortize the cost of Monte Carlo walks across the distribution of PDE instances using stochastic representations from the WoS algorithm to generate cheap, noisy, estimates of the PDE solution during training. This is formulated into a data‑free physics‑informed objective where a neural operator is trained to regress against these weak supervisions, allowing the operator to learn a generalized solution map for an entire family of PDEs. This strategy does not require expensive pre‑computed datasets, avoids computing higher‑order derivatives for loss functions that are memory‑intensive and unstable, and demonstrates zero‑shot generalization to novel PDE parameters and domains. Experiments show that for the same number of training steps, our method exhibits up to 8.75× improvement in L_2‑error compared to standard physics‑informed training schemes, up to 6.31× improvement in training speed, and reductions of up to 2.97× in GPU memory consumption. We present the code at https://github.com/neuraloperator/WoS‑NO

Authors:Sumin Kim, Hyemin Jeong, Mingu Kang, Yejin Kim, Yoori Oh, Joonseok Lee
Title: TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
Abstract:
The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality‑agnostic fusion strategies. These methods fail to account for the dynamic, frame‑dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large‑scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state‑of‑the‑art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.

Authors:Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo, Guilin Qi, Shirui Pan, Gholamreza Haffari
Title: CARD: Towards Conditional Design of Multi-agent Topological Structures
Abstract:
Large language model (LLM)‑based multi‑agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real‑world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph‑generation framework that instantiates AMACP, a protocol for adaptive multi‑agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment‑aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt‑based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: https://github.com/Warma10032/CARD.

Authors:Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen
Title: LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
Abstract:
We present LLaDA‑o, an effective and length‑adaptive omni diffusion model for multimodal understanding and generation. LLaDA‑o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data‑centric length adaptation strategy that enables flexible‑length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA‑o achieves state‑of‑the‑art performance among omni‑diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG‑Bench for text‑to‑image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML‑GSAI/LLaDA‑o.

Authors:Jiafeng Lin, Yuxuan Wang, Jialong Wu, Huakun Luo, Zhongyi Pei, Jianmin Wang
Title: Thoth: Mid-Training Bridges LLMs to Time Series Understanding
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success in general‑purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision‑making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid‑trained LLMs with general‑purpose time series understanding capabilities. As a pivotal intermediate stage, mid‑training achieves task‑ and domain‑agnostic alignment between time series and natural language, for which we construct Book‑of‑Thoth, a high‑quality, time‑series‑centric mid‑training corpus. Book‑of‑Thoth enables both time‑series‑to‑text and text‑to‑time‑series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge‑intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid‑training with Book‑of‑Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine‑tuned under data scarcity, underscoring the effectiveness of mid‑training for time series understanding. Code is available at: https://github.com/thuml/Thoth.

Authors:Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang
Title: Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
Abstract:
Tensor Ring (TR) decomposition is a powerful tool for high‑order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non‑meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine‑scale details is intrinsically difficult. Through a frequency‑domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high‑frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super‑resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at https://github.com/YangyangXu2002/RepTRFD.

Authors:Tony Hauptmann, Stefan Kramer
Title: Feature-Weighted Maximum Representative Subsampling
Abstract:
In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non‑representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature‑weighted MRS (FW‑MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non‑representative datasets. We validated FW‑MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW‑MRS on downstream tasks and found no statistically significant differences. Additionally, FW‑MRS was applied to a real‑world dataset from the social sciences. The source code is available at https://github.com/kramerlab/FeatureWeightDebiasing.

Authors:Ke Sun, Hongming Zhang, Jun Jin, Chao Gao, Xi Chen, Wulong Liu, Linglong Kong
Title: Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning
Abstract:
Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual‑learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi‑task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm‑up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel‑based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual‑learner approach relative to baseline methods. The code is released in https://github.com/datake/FAME.

Authors:Igor Berezkin
Title: Wave-Attractor-Tree: A Hierarchical Binary Tree Reduction Architecture for Efficient Sequence Modeling
Abstract:
Work introduces a hierarchical binary tree‑based reduction that replaces standard self‑attention. The core idea is to use a recursive Gated Linear Unit merge operation, achieving O(n) total merge operations O(log n) parallel depth O(n d^2) total work and O(n) space complexity. In these experiments, the model significantly outperforms standard Transformers in both convergence speed and accuracy on long‑range structural dependencies, specifically where hierarchical inductive bias is critical.

Authors:Shilong Tao, Zhe Feng, Shaohan Chen, Weichen Zhang, Zhanxing Zhu, Yunhuai Liu
Title: Neural Latent Arbitrary Lagrangian-Eulerian Grids for Fluid-Solid Interaction
Abstract:
Fluid‑solid interaction (FSI) problems are fundamental in many scientific and engineering applications, yet effectively capturing the highly nonlinear two‑way interactions remains a significant challenge. Most existing deep learning methods are limited to simplified one‑way FSI scenarios, often assuming rigid and static solid to reduce complexity. Even in two‑way setups, prevailing approaches struggle to capture dynamic, heterogeneous interactions due to the lack of cross‑domain awareness. In this paper, we introduce Fisale, a data‑driven framework for handling complex two‑way FSI problems. It is inspired by classical numerical methods, namely the Arbitrary Lagrangian‑Eulerian (ALE) method and the partitioned coupling algorithm. Fisale explicitly models the coupling interface as a distinct component and leverages multiscale latent ALE grids to provide unified, geometry‑aware embeddings across domains. A partitioned coupling module (PCM) further decomposes the problem into structured substeps, enabling progressive modeling of nonlinear interdependencies. Compared to existing models, Fisale introduces a more flexible framework that iteratively handles complex dynamics of solid, fluid and their coupling interface on a unified representation, and enables scalable learning of complex two‑way FSI behaviors. Experimentally, Fisale excels in three reality‑related challenging FSI scenarios, covering 2D, 3D and various tasks. The code is available at \hrefhttps://github.com/therontau0054/Fisale.

Authors:Minkyoung Cho, Insu Jang, Shuowei Jin, Zesen Zhao, Adityan Jothi, Ethem F. Can, Min-Hung Chen, Z. Morley Mao
Title: MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search
Abstract:
Fine‑tuning Multimodal Large Language Models (MLLMs) with parameter‑efficient methods like Low‑Rank Adaptation (LoRA) is crucial for task adaptation. However, imbalanced training dynamics across modalities often lead to suboptimal accuracy due to negative interference, a challenge typically addressed with inefficient heuristic methods such as manually tuning separate learning rates. To overcome this, we introduce MARS (Multimodal Adaptive Rank Search), an approach to discover optimal rank pairs that balance training dynamics while maximizing performance. Our key innovation, a proposed framework of dual scaling laws, enables this search: one law models module‑specific convergence time to prune the search space to candidates with aligned dynamics, while the other predicts final task performance to select the optimal pair from the pruned set. By re‑purposing the LoRA rank as a controller for modality‑specific convergence speed, MARS outperforms baseline methods and provides a robust, automated strategy for optimizing MLLM fine‑tuning.

Authors:Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Zhiqiang Shen
Title: Exploring 3D Dataset Pruning
Abstract:
Dataset pruning has been widely studied for 2D images to remove redundancy and accelerate training, while particular pruning methods for 3D data remain largely unexplored. In this work, we study dataset pruning for 3D data, where its observed common long‑tail class distribution nature make optimization under conventional evaluation metrics Overall Accuracy (OA) and Mean Accuracy (mAcc) inherently conflicting, and further make pruning particularly challenging. To address this, we formulate pruning as approximating the full‑data expected risk with a weighted subset, which reveals two key errors: coverage error from insufficient representativeness and prior‑mismatch bias from inconsistency between subset‑induced class weights and target metrics. We propose representation‑aware subset selection with per‑class retention quotas for long‑tail coverage, and prior‑invariant teacher supervision using calibrated soft labels and embedding‑geometry distillation. The retention quota also serves as a switch to control the OA‑mAcc trade‑off. Extensive experiments on 3D datasets show that our method can improve both metrics across multiple settings while adapting to different downstream preferences. Our code is available at https://github.com/XiaohanZhao123/3D‑Dataset‑Pruning.

Authors:Jin Zeng, Yupeng Qi, Hui Li, Chengming Li, Ziyu Lyu, Lixin Cui, Lu Bai
Title: RAIE: Region-Aware Incremental Preference Editing with LoRA for LLM-based Recommendation
Abstract:
Large language models (LLMs) are increasingly adopted as the backbone of recommender systems. However, user‑item interactions in real‑world scenarios are non‑stationary, making preference drift over time inevitable. Existing model update strategies mainly rely on global fine‑tuning or pointwise editing, but they face two fundamental challenges: (i) imbalanced update granularity, where global updates perturb behaviors unrelated to the target while pointwise edits fail to capture broader preference shifts; (ii) unstable incremental updates, where repeated edits interfere with prior adaptations, leading to catastrophic forgetting and inconsistent recommendations. To address these issues, we propose Region‑Aware Incremental Editing (RAIE), a plug‑in framework that freezes the backbone model and performs region‑level updates. RAIE first constructs semantically coherent preference regions via spherical k‑means in the representation space. It then assigns incoming sequences to regions via confidence‑aware gating and performs three localized edit operations ‑ Update, Expand, and Add ‑ to dynamically revise the affected region. Each region is equipped with a dedicated Low‑Rank Adaptation (LoRA) module, which is trained only on the region's updated data. During inference, RAIE routes each user sequence to its corresponding region and activates the region‑specific adapter for prediction. Experiments on two benchmark datasets under a time‑sliced protocol that segments data into Set‑up (S), Finetune (F), and Test (T) show that RAIE significantly outperforms state‑of‑the‑art baselines while effectively mitigating forgetting. These results demonstrate that region‑aware editing offers an accurate and scalable mechanism for continual adaptation in dynamic recommendation scenarios. Our code is available at https://github.com/fengaogao/RAIE.

Authors:Cedric Damour
Title: Retrodictive Forecasting: A Proof-of-Concept for Exploiting Temporal Asymmetry in Time Series Prediction
Abstract:
We propose a retrodictive forecasting paradigm for time series: instead of predicting the future from the past, we identify the future that best explains the observed present via inverse MAP optimization over a Conditional Variational Autoencoder (CVAE). This conditioning is a statistical modeling choice for Bayesian inversion; it does not assert that future events cause past observations. The approach is theoretically grounded in an information‑theoretic arrow‑of‑time measure: the symmetrized Kullback‑Leibler divergence between forward and time‑reversed trajectory ensembles provides both the conceptual rationale and an operational GO/NO‑GO diagnostic for applicability. We implement the paradigm as MAP inference over an inverse CVAE with a learned RealNVP normalizing‑flow prior and evaluate it on six time series cases: four synthetic processes with controlled temporal asymmetry and two ERA5 reanalysis datasets (wind speed and solar irradiance). The work makes four contributions: (i) a formal retrodictive inference formulation; (ii) an inverse CVAE architecture; (iii) a model‑free irreversibility diagnostic; and (iv) a falsifiable validation protocol with four pre‑specified predictions. All pre‑specified predictions are empirically supported: the diagnostic correctly classifies all six cases; the learned flow prior improves over an isotropic Gaussian baseline on GO cases; the inverse MAP yields no spurious advantage on time‑reversible dynamics; and on irreversible GO cases, it achieves competitive or superior RMSE relative to forward baselines, with a statistically significant 17.7% reduction over a forward MLP on ERA5 solar irradiance. These results provide a structured proof‑of‑concept that retrodictive forecasting can constitute a viable alternative to conventional forward prediction when statistical time‑irreversibility is present and exploitable.

Authors:Yunzhong Qiu, Zhiyao Cen, Zhongyi Pei, Chen Wang, Jianmin Wang
Title: Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models
Abstract:
Large time series models (LTMs) have emerged as powerful tools for universal forecasting, yet they often struggle with the inherent diversity and nonstationarity of real‑world time series data, leading to an unsatisfactory trade‑off between forecasting accuracy and generalization. Rather than continually finetuning new LTM instances for each domain, we propose a data‑centric framework, time‑series adaptive transformation optimization (TATO), that enables a single frozen pre‑trained LTM to adapt to diverse downstream domains through an optimally configured transformation pipeline. Specifically, TATO constructs three representative types of transformations, including context slicing, scale normalization, and outlier correction, to help LTMs better align with target domain characteristics. To ensure robustness, we incorporate carefully selected time series augmentations and a two‑stage ranking mechanism that filters out pipelines underperforming on specific metrics. Extensive experiments on state‑of‑the‑art LTMs and widely used datasets demonstrate that TATO consistently and significantly improves domain‑adaptive forecasting performance, achieving a maximum reduction in MSE of 65.4% and an average reduction of 13.6%. Moreover, TATO is highly efficient, typically completing optimization in under 2 minutes, making it practical for real‑world deployment. The source code is available at https://github.com/thulab/TATO.

Authors:Zhanwang Liu, Yuting Li, Haoyuan Gao, Yexin Li, Linghe Kong, Lichao Sun, Weiran Huang
Title: IDER: IDempotent Experience Replay for Reliable Continual Learning
Abstract:
Catastrophic forgetting, the tendency of neural networks to forget previously learned knowledge when learning new tasks, has been a major challenge in continual learning (CL). To tackle this challenge, CL methods have been proposed and shown to reduce forgetting. Furthermore, CL models deployed in mission‑critical settings can benefit from uncertainty awareness by calibrating their predictions to reliably assess their confidences. However, existing uncertainty‑aware continual learning methods suffer from high computational overhead and incompatibility with mainstream replay methods. To address this, we propose idempotent experience replay (IDER), a novel approach based on the idempotent property where repeated function applications yield the same output. Specifically, we first adapt the training loss to make model idempotent on current data streams. In addition, we introduce an idempotence distillation loss. We feed the output of the current model back into the old checkpoint and then minimize the distance between this reprocessed output and the original output of the current model. This yields a simple and effective new baseline for building reliable continual learners, which can be seamlessly integrated with other CL approaches. Extensive experiments on different CL benchmarks demonstrate that IDER consistently improves prediction reliability while simultaneously boosting accuracy and reducing forgetting. Our results suggest the potential of idempotence as a promising principle for deploying efficient and trustworthy continual learning systems in real‑world applications.Our code is available at https://github.com/YutingLi0606/Idempotent‑Continual‑Learning.

Authors:Li Sun, Zhenhao Huang, Silei Chen, Lanxu Yang, Junda Ye, Sen Su, Philip S. Yu
Title: Multi-Domain Riemannian Graph Gluing for Building Graph Foundation Models
Abstract:
Multi‑domain graph pre‑training integrates knowledge from diverse domains to enhance performance in the target domains, which is crucial for building graph foundation models. Despite initial success, existing solutions often fall short of answering a fundamental question: how is knowledge integrated or transferred across domains? This theoretical limitation motivates us to rethink the consistency and transferability between model pre‑training and domain adaptation. In this paper, we propose a fresh Riemannian geometry perspective, whose core idea is to merge any graph dataset into a unified, smooth Riemannian manifold, enabling a systematic understanding of knowledge integration and transfer. To achieve this, our key contribution is the theoretical establishment of neural manifold gluing, which first characterizes local geometry using an adaptive orthogonal frame and then "glues" the local pieces together into a coherent whole. Building on this theory, we present the GraphGlue framework, which supports batched pre‑training with EMA prototyping and provides a transferability measure based on geometric consistence. Extensive experiments demonstrate its superior performance across diverse graph domains. Moreover, we empirically validated GraphGlue's geometric scaling law, showing that larger quantities of datasets improve model transferability by producing a smoother manifold. Codes are available at https://github.com/RiemannGraph/GraphGlue.

Authors:Yuchen Hou, Lin Zhao
Title: LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
Abstract:
Vision‑Language‑Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state‑of‑the‑art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four‑dimensional semantic perturbation method ‑‑ varying instruction semantics while keeping the tabletop layout fixed ‑‑ revealing language understanding deficits in π0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick‑and‑place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap ‑‑ success rate improves from 0% to 90% with single‑task training, and 0% to 28% with multi‑task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions ‑‑ precisely the long‑term value of LangGap.

Authors:Xu Luo, Ji Zhang, Lianli Gao, Heng Tao Shen, Jingkuan Song
Title: Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols
Abstract:
Few‑shot transfer has been revolutionized by stronger pre‑trained models and improved adaptation algorithms.However, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real‑world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the "validation set illusion" in data‑scarce regimes. Our empirical findings demonstrate that the choice of pre‑trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full‑parameter fine‑tuning baseline. To explain this surprising effectiveness, we provide an in‑depth mechanistic analysis showing that full fine‑tuning succeeds via distributed micro‑adjustments and more flexible reshaping of high‑level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few‑shot transfer learning research. We make the FEWTRANS benchmark publicly available at https://github.com/Frankluox/FewTrans.

Authors:Sevda Öğüt, Cédric Vincent-Cuaz, Natalia Dubljevic, Carlos Hurtado, Vaishnavi Subramanian, Pascal Frossard, Dorina Thanou
Title: GrapHist: Graph Self-Supervised Learning for Histopathology
Abstract:
Self‑supervised vision models have achieved notable success in digital pathology. However, their domain‑agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically‑informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph‑based self‑supervised learning framework for histopathology, which learns generalizable and structurally‑informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre‑train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in‑ and out‑of‑domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision‑based counterparts in slide‑, region‑, and cell‑level tasks, while requiring four times fewer parameters. It also drastically outperforms fully‑supervised graph models on cancer subtyping tasks. Finally, we also release five graph‑based digital pathology datasets used in our study at https://huggingface.co/ogutsevda/datasets , establishing the first large‑scale graph benchmark in this field. Our code is available at https://github.com/ogutsevda/graphist .

Authors:Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal
Title: Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion
Abstract:
Text‑to‑image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability‑Aware Diffusion Steering (RADS), an inference‑time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"‑‑the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state‑of‑the‑art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug‑and‑play solution for safe generation. Our website is available at: https://s‑karnik.github.io/rads‑memorization‑project‑page/.

Authors:Atah Nuh Mih, Jianzhou Wang, Truong Thanh Hung Nguyen, Hung Cao
Title: SEval-NAS: A Search-Agnostic Evaluation for Neural Architecture Search
Abstract:
Neural architecture search (NAS) automates the discovery of neural networks that meet specified criteria, yet its evaluation procedures are often hardcoded, limiting the ability to introduce new metrics. This issue is especially pronounced in hardware‑aware NAS, where objectives depend on target devices such as edge hardware. To address this limitation, we propose SEval‑NAS, a metric‑evaluation mechanism that converts architectures to strings, embeds them as vectors, and predicts performance metrics. Using NATS‑Bench and HW‑NAS‑Bench, we evaluated accuracy, latency, and memory. Kendall's τ correlations showed stronger latency and memory predictions than accuracy, indicating the suitability of SEval‑NAS as a hardware cost predictor. We further integrated SEval‑NAS into FreeREA to evaluate metrics not originally included. The method successfully ranked FreeREA‑generated architectures, maintained search time, and required minimal algorithmic changes. Our implementation is available at: https://github.com/Analytics‑Everywhere‑Lab/neural‑architecture‑search

Authors:David Jackson, Michael Gertz, Jürgen Hesser
Title: Exploring Drug Safety Through Knowledge Graphs: Protein Kinase Inhibitors as a Case Study
Abstract:
Adverse Drug Reactions (ADRs) are a leading cause of morbidity and mortality. Existing prediction methods rely mainly on chemical similarity, machine learning on structured databases, or isolated target profiles, but often fail to integrate heterogeneous, partly unstructured evidence effectively. We present a knowledge graph‑based framework that unifies diverse sources, drug‑target data (ChEMBL), clinical trial literature (PubMed), trial metadata (ClinicalTrials.gov), and post‑marketing safety reports (FAERS) into a single evidence‑weighted bipartite network of drugs and medical conditions. Applied to 400 protein kinase inhibitors, the resulting network enables contextual comparison of efficacy (HR, PFS, OS), phenotypic and target similarity, and ADR prediction via target‑to‑adverse‑event correlations. A non‑small cell lung cancer case study correctly highlights established and candidate drugs, target communities (ERbB, ALK, VEGF), and tolerability differences. Designed as an orthogonal, extensible analysis and search tool rather than a replacement for current models, the framework excels at revealing complex patterns, supporting hypothesis generation, and enhancing pharmacovigilance. Code and data are publicly available at https://github.com/davidjackson99/PKI_KG.

Authors:Chao Huang, Yanhui Li, Yunkang Cao, Wei Wang, Hongxi Huang, Jie Wen, Wenqi Ren, Xiaochun Cao
Title: M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection
Abstract:
Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero‑shot paradigm, they still tend to produce high‑confidence yet unreliable decisions in fine‑grained and structurally complex industrial scenarios, and lack effective self‑corrective mechanisms. To address this issue, we propose M3‑AD, a unified reflection‑aware multimodal framework for industrial anomaly detection. M3‑AD comprises two complementary data resources: M3‑AD‑FT, designed for reflection‑aligned fine‑tuning, and M3‑AD‑Bench, designed for systematic cross‑category evaluation, together providing a foundation for reflection‑aware learning and reliability assessment. Building upon this foundation, we propose RA‑Monitor, which models reflection as a learnable decision revision process and guides models to perform controlled self‑correction when initial judgments are unreliable, thereby improving decision robustness. Extensive experiments conducted on M3‑AD‑Bench demonstrate that RA‑Monitor outperforms multiple open‑source and commercial MLLMs in zero‑shot anomaly detection and anomaly analysis tasks. Code will be released at https://github.com/Yanhui‑Lee/M3‑AD.

Authors:Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu
Title: Breaking the Factorization Barrier in Diffusion Language Models
Abstract:
Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the "factorization barrier": the assumption that simultaneously predicted tokens are independent. This limitation forces a trade‑off: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We propose Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully‑factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically, CoDD seamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few‑step generation, enabling high‑quality outputs at significantly reduced latencies. Code available at: https://github.com/liuanji/CoDD

Authors:Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala
Title: CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
Abstract:
LLM‑as‑a‑judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders ‑‑ such as verbosity, stylistic preferences, or training artifacts ‑‑ causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder‑aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true‑quality signal and shared confounding factors. Rather than heuristically re‑weighting judges, CARE separates quality from confounders without access to ground‑truth labels. We provide theoretical guarantees for identifiability and finite‑sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8%. Code is released in \hrefhttps://github.com/SprocketLab/CAREhttps://github.com/SprocketLab/CARE.

Authors:Jintao Zhang, Zirui Liu, Mingyue Cheng, Xianquan Wang, Zhiding Liu, Qi Liu
Title: StaTS: Spectral Trajectory Schedule Learning for Adaptive Time Series Forecasting with Frequency Guided Denoiser
Abstract:
Diffusion models have been used for probabilistic time series forecasting and show strong potential. However, fixed noise schedules often produce intermediate states that are hard to invert and a terminal state that deviates from the near noise assumption. Meanwhile, prior methods rely on time domain conditioning and seldom model schedule induced spectral degradation, which limits structure recovery across noise levels. We propose StaTS, a diffusion model for probabilistic time series forecasting that learns the noise schedule and the denoiser through alternating updates. StaTS includes Spectral Trajectory Scheduler (STS) that learns a data adaptive noise schedule with spectral regularization to improve structural preservation and stepwise invertibility, and Frequency Guided Denoiser (FGD) that estimates schedule induced spectral distortion and uses it to modulate denoising strength for heterogeneous restoration across diffusion steps and variables. A two stage training procedure stabilizes the coupling between schedule learning and denoiser optimization. Experiments on multiple real world benchmarks show consistent gains, while maintaining strong performance with fewer sampling steps. Our code is available at https://github.com/zjt‑gpu/StaTS/.

Authors:Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat
Title: Mode Seeking meets Mean Seeking for Fast Long Video Generation
Abstract:
Scaling video generation from seconds to minutes faces a critical bottleneck: while short‑video data is abundant and high‑fidelity, coherent long‑form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long‑term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short‑video teacher via a mode‑seeking reverse‑KL divergence. This strategy enables the synthesis of minute‑scale videos that learns long‑range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding‑window segment of the student to a frozen short‑video teacher, resulting in a few‑step fast long video generator. Evaluations show that our method effectively closes the fidelity‑horizon gap by jointly improving local sharpness, motion and long‑range consistency. Project website: https://primecai.github.io/mmm/.

Authors:Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Title: Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Abstract:
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first‑ and second‑order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA‑Pre, a novel low‑rank optimizer designed for efficient pre‑training. Specifically, LoRA‑Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low‑rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA‑Pre's efficacy by pre‑training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA‑Pre achieves the highest performance across all model sizes. Notably, LoRA‑Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre‑training, we evaluate LoRA‑Pre's effectiveness in fine‑tuning scenarios. With the same rank, LoRA‑Pre consistently outperforms all efficient fine‑tuning baselines. Specifically, compared to standard LoRA, LoRA‑Pre achieves substantial improvements of 3.14 points on Llama‑3.1‑8B and 6.17 points on Llama‑2‑7B, validating our approach's effectiveness across both pre‑training and fine‑tuning paradigms. Our code is publicly available at https://github.com/mrflogs/LoRA‑Pre.

Authors:Arnas Uselis, Andrea Dittadi, Seong Joon Oh
Title: Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models
Abstract:
Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per‑concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low‑rank, near‑orthogonal per‑concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary‑compositionality.

Authors:Eugène Berta, Sacha Braun, David Holzmüller, Francis Bach, Michael I. Jordan
Title: A Variational Estimator for $L_p$ Calibration Errors
Abstract:
Calibration\unicodex2014the problem of ensuring that predicted probabilities align with observed class frequencies\unicodex2014is a basic desideratum for reliable prediction with machine learning systems. Calibration error is traditionally assessed via a divergence function, using the expected divergence between predictions and empirical frequencies. Accurately estimating this quantity is challenging, especially in the multiclass setting. Here, we show how to extend a recent variational framework for estimating calibration errors beyond divergences induced induced by proper losses, to cover a broad class of calibration errors induced by L_p divergences. Our method can separate over‑ and under‑confidence and, unlike non‑variational approaches, avoids overestimation. We provide extensive experiments and integrate our code in the open‑source package probmetrics (https://github.com/dholzmueller/probmetrics) for evaluating calibration errors.

Authors:Miras Seilkhan, Adilbek Taizhanov
Title: Comparing Classical and Quantum Variational Classifiers on the XOR Problem
Abstract:
Quantum machine learning applies principles such as superposition and entanglement to data processing and optimization. Variational quantum models operate on qubits in high‑dimensional Hilbert spaces and provide an alternative approach to model expressivity. We compare classical models and a variational quantum classifier on the XOR problem. Logistic regression, a one‑hidden‑layer multilayer perceptron, and a two‑qubit variational quantum classifier with circuit depths 1 and 2 are evaluated on synthetic XOR datasets with varying Gaussian noise and sample sizes using accuracy and binary cross‑entropy. Performance is determined primarily by model expressivity. Logistic regression and the depth‑1 quantum circuit fail to represent XOR reliably, whereas the multilayer perceptron and the depth‑2 quantum circuit achieve perfect test accuracy under representative conditions. Robustness analyses across noise levels, dataset sizes, and random seeds confirm that circuit depth is decisive for quantum performance on this task. Despite matching accuracy, the multilayer perceptron achieves lower binary cross‑entropy and substantially shorter training time. Hardware execution preserves the global XOR structure but introduces structured deviations in the decision function. Overall, deeper variational quantum classifiers can match classical neural networks in accuracy on low‑dimensional XOR benchmarks, but no clear empirical advantage in robustness or efficiency is observed in the examined settings.

Authors:Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie, Ido Hakimi, Barna Pásztor, Andreas Krause
Title: RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Abstract:
Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty‑guided active learning and mitigate reward overoptimization in LLM post‑training. However, uncertainty‑aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open‑source framework as a Python package. Our code is available at https://github.com/lasgroup/rewarduq.

Authors:Xianglong Shi, Ziheng Chen, Yunhan Jiang, Nicu Sebe
Title: Intrinsic Lorentz Neural Network
Abstract:
Real‑world data frequently exhibit latent hierarchical structures, which can be naturally represented by hyperbolic geometry. Although recent hyperbolic neural networks have demonstrated promising results, many existing architectures remain partially intrinsic, mixing Euclidean operations with hyperbolic ones or relying on extrinsic parameterizations. To address it, we propose the \emphIntrinsic Lorentz Neural Network (ILNN), a fully intrinsic hyperbolic architecture that conducts all computations within the Lorentz model. At its core, the network introduces a novel \emphpoint‑to‑hyperplane fully connected layer (FC), replacing traditional Euclidean affine logits with closed‑form hyperbolic distances from features to learned Lorentz hyperplanes, thereby ensuring that the resulting geometric decision functions respect the inherent curvature. Around this fundamental layer, we design intrinsic modules: GyroLBN, a Lorentz batch normalization that couples gyro‑centering with gyro‑scaling, consistently outperforming both LBN and GyroBN while reducing training time. We additionally proposed a gyro‑additive bias for the FC output, a Lorentz patch‑concatenation operator that aligns the expected log‑radius across feature blocks via a digamma‑based scale, and a Lorentz dropout layer. Extensive experiments conducted on CIFAR‑10/100 and two genomic benchmarks (TEB and GUE) illustrate that ILNN achieves state‑of‑the‑art performance and computational cost among hyperbolic models and consistently surpasses strong Euclidean baselines. The code is available at \hrefhttps://github.com/Longchentong/ILNN\textcolormagentathis url.

Authors:Andrei-Alexandru Bunea, Dan-Matei Popovici, Radu Tudor Ionescu
Title: SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation
Abstract:
State‑of‑the‑art models for medical image segmentation achieve excellent accuracy but require substantial computational resources, limiting deployment in resource‑constrained clinical settings. We present SegMate, an efficient 2.5D framework that achieves state‑of‑the‑art accuracy, while considerably reducing computational requirements. Our efficient design is the result of meticulously integrating asymmetric architectures, attention mechanisms, multi‑scale feature fusion, slice‑based positional conditioning, and multi‑task optimization. We demonstrate the efficiency‑accuracy trade‑off of our framework across three modern backbones (EfficientNetV2‑M, MambaOut‑Tiny, FastViT‑T12). We perform experiments on three datasets: TotalSegmentator, SegTHOR and AMOS22. Compared with the vanilla models, SegMate reduces computation (GFLOPs) by up to 2.5x and memory footprint (VRAM) by up to 2.1x, while generally registering performance gains of around 1%. On TotalSegmentator, we achieve a Dice score of 93.51% with only 295MB peak GPU memory. Zero‑shot cross‑dataset evaluations on SegTHOR and AMOS22 demonstrate strong generalization, with Dice scores of up to 86.85% and 89.35%, respectively. We release our open‑source code at https://github.com/andreibunea99/SegMate.

Authors:Ning Gao, Xiuhui Zhang, Xingyu Jiang, Mukang You, Mohan Zhang, Yue Deng
Title: RF-Agent: Automated Reward Function Design via Language Agent Tree Search
Abstract:
Designing efficient reward functions for low‑level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward functions with greedy or evolutionary algorithms. However, they suffer from poor utilization of historical feedback and inefficient search, resulting in limited improvements in complex control tasks. To address this challenge, we propose RF‑Agent, a framework that treats LLMs as language agents and frames reward function design as a sequential decision‑making process, enhancing optimization through better contextual reasoning. RF‑Agent integrates Monte Carlo Tree Search (MCTS) to manage the reward design and optimization process, leveraging the multi‑stage contextual reasoning ability of LLMs. This approach better utilizes historical information and improves search efficiency to identify promising reward functions. Outstanding experimental results in 17 diverse low‑level control tasks demonstrate the effectiveness of our method. The source code is available at https://github.com/deng‑ai‑lab/RF‑Agent.

Authors:Zhaowen Wang, Dongdong Zhou, Qi Xu, Fengyu Cong, Mohammad Al-Sa'd, Jenni Raitoharju
Title: ULW-SleepNet: An Ultra-Lightweight Network for Multimodal Sleep Stage Scoring
Abstract:
Automatic sleep stage scoring is crucial for the diagnosis and treatment of sleep disorders. Although deep learning models have advanced the field, many existing models are computationally demanding and designed for single‑channel electroencephalography (EEG), limiting their practicality for multimodal polysomnography (PSG) data. To overcome this, we propose ULW‑SleepNet, an ultra‑lightweight multimodal sleep stage scoring framework that efficiently integrates information from multiple physiological signals. ULW‑SleepNet incorporates a novel Dual‑Stream Separable Convolution (DSSC) Block, depthwise separable convolutions, channel‑wise parameter sharing, and global average pooling to reduce computational overhead while maintaining competitive accuracy. Evaluated on the Sleep‑EDF‑20 and Sleep‑EDF‑78 datasets, ULW‑SleepNet achieves accuracies of 86.9% and 81.4%, respectively, with only 13.3K parameters and 7.89M FLOPs. Compared to state‑of‑the‑art methods, our model reduces parameters by up to 98.6% with only marginal performance loss, demonstrating its strong potential for real‑time sleep monitoring on wearable and IoT devices. The source code for this study is publicly available at https://github.com/wzw999/ULW‑SLEEPNET.

Authors:Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, Yuanyuan Liu
Title: FedNSAM:Consistency of Local and Global Flatness for Federated Learning
Abstract:
In federated learning (FL), multi‑step local updates and data heterogeneity usually lead to sharper global minima, which degrades the performance of the global model. Popular FL algorithms integrate sharpness‑aware minimization (SAM) into local training to address this issue. However, in the high data heterogeneity setting, the flatness in local training does not imply the flatness of the global model. Therefore, minimizing the sharpness of the local loss surfaces on the client data does not enable the effectiveness of SAM in FL to improve the generalization ability of the global model. We define the flatness distance to explain this phenomenon. By rethinking the SAM in FL and theoretically analyzing the flatness distance, we propose a novel FedNSAM algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. FedNSAM uses the global Nesterov momentum as the direction of local estimation of client global perturbations and extrapolation. Theoretically, we prove a tighter convergence bound than FedSAM by Nesterov extrapolation. Empirically, we conduct comprehensive experiments on CNN and Transformer models to verify the superior performance and efficiency of FedNSAM. The code is available at https://github.com/junkangLiu0/FedNSAM.

Authors:Tiantong Wang, Xinyu Yan, Tiantong Wu, Yurong Hao, Yong Jiang, Fei Huang, Wei Yang Bryan Lim
Title: MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
Abstract:
Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non‑disclosure constraint, we propose MPU, an algorithm‑agnostic privacy‑preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server‑side modules: Pre‑Process for randomized copy generation and Post‑Process for update aggregation. In Pre‑Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post‑Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise‑free baselines, with most algorithms' average degradation well below 1% under 10% noise, and can even outperform the noise‑free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan‑SHU/MPU.

Authors:Dingqi Ye, Daniel Kiv, Wei Hu, Jimeng Shi, Shaowen Wang
Title: Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand
Abstract:
The remote sensing community is witnessing a rapid growth of foundation models, which provide powerful embeddings for a wide range of downstream tasks. However, practical adoption and fair comparison remain challenging due to substantial heterogeneity in model release formats, platforms and interfaces, and input data specifications. These inconsistencies significantly increase the cost of obtaining, using, and benchmarking embeddings across models. To address this issue, we propose rs‑embed, a Python library that offers a unified, region of interst (ROI) centric interface: with a single line of code, users can retrieve embeddings from any supported model for any location and any time range. The library also provides efficient batch processing to enable large‑scale embedding generation and evaluation. The code is available at: https://github.com/cybergis/rs‑embed

Authors:Brandon Yee, Lucas Wang, Kundana Kommini, Krishna Sharma
Title: Geodesic Semantic Search: Learning Local Riemannian Metrics for Citation Graph Retrieval
Abstract:
We present Geodesic Semantic Search (GSS), a retrieval system that learns node‑specific Riemannian metrics on citation graphs to enable geometry‑aware semantic search. Unlike standard embedding‑based retrieval that relies on fixed Euclidean distances, \gss learns a low‑rank metric tensor \mL_i \in \R^d × r at each node, inducing a local positive semi‑definite metric \mG_i = \mL_i \mL_i^\top + \eps \mI. This parameterization guarantees valid metrics while keeping the model tractable. Retrieval proceeds via multi‑source Dijkstra on the learned geodesic distances, followed by Maximal Marginal Relevance reranking and path coherence filtering. On citation prediction benchmarks with 169K papers, \gss achieves 23% relative improvement in Recall@20 over SPECTER+FAISS baselines while providing interpretable citation paths. Our hierarchical coarse‑to‑fine search with k‑means pooling reduces computational cost by 4× compared to flat geodesic search while maintaining 97% retrieval quality. We provide theoretical analysis of when geodesic distances outperform direct similarity, characterize the approximation quality of low‑rank metrics, and validate predictions empirically. Code and trained models are available at https://github.com/YCRG‑Labs/geodesic‑search.

Authors:Kohei Obata, Taichi Murayama, Zheng Chen, Yasuko Matsubara, Yasushi Sakurai
Title: Disentangled Mode-Specific Representations for Tensor Time Series via Contrastive Learning
Abstract:
Multi‑mode tensor time series (TTS) can be found in many domains, such as search engines and environmental monitoring systems. Learning representations of a TTS benefits various applications, but it is also challenging since the complexities inherent in the tensor hinder the realization of rich representations. In this paper, we propose a novel representation learning method designed specifically for TTS, namely MoST. Specifically, MoST uses a tensor slicing approach to reduce the complexity of the TTS structure and learns representations that can be disentangled into individual non‑temporal modes. Each representation captures mode‑specific features, which are the relationship between variables within the same mode, and mode‑invariant features, which are in common in representations of different modes. We employ a contrastive learning framework to learn parameters; the loss function comprises two parts intended to learn representation in a mode‑specific way and mode‑invariant way, effectively exploiting disentangled representations as augmentations. Extensive experiments on real‑world datasets show that MoST consistently outperforms the state‑of‑the‑art methods in terms of classification and forecasting accuracy. Code is available at https://github.com/KoheiObata/MoST.

Authors:Kejing Yin, Haizhou Xu, Wenfang Yao, Chen Liu, Zijie Chen, Yui Haang Cheung, William K. Cheung, Jing Qin
Title: When Does Multimodal Learning Help in Healthcare? A Benchmark on EHR and Chest X-Ray Fusion
Abstract:
Machine learning holds promise for advancing clinical decision support, yet it remains unclear when multimodal learning truly helps in practice, particularly under modality missingness and fairness constraints. In this work, we conduct a systematic benchmark of multimodal fusion between Electronic Health Records (EHR) and chest X‑rays (CXR) on standardized cohorts from MIMIC‑IV and MIMIC‑CXR, aiming to answer four fundamental questions: when multimodal fusion improves clinical prediction, how different fusion strategies compare, how robust existing methods are to missing modalities, and whether multimodal models achieve algorithmic fairness. Our study reveals several key insights. Multimodal fusion improves performance when modalities are complete, with gains concentrating in diseases that require complementary information from both EHR and CXR. While cross‑modal learning mechanisms capture clinically meaningful dependencies beyond simple concatenation, the rich temporal structure of EHR introduces strong modality imbalance that architectural complexity alone cannot overcome. Under realistic missingness, multimodal benefits rapidly degrade unless models are explicitly designed to handle incomplete inputs. Moreover, multimodal fusion does not inherently improve fairness, with subgroup disparities mainly arising from unequal sensitivity across demographic groups. To support reproducible and extensible evaluation, we further release a flexible benchmarking toolkit that enables plug‑and‑play integration of new models and datasets. Together, this work provides actionable guidance on when multimodal learning helps, when it fails, and why, laying the foundation for developing clinically deployable multimodal systems that are both effective and reliable. The open‑source toolkit can be found at https://github.com/jakeykj/CareBench.

Authors:Abhishek Dalvi, Vasant Honavar
Title: Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
Abstract:
Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine‑tuning. Such approaches depend on large‑scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross‑modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross‑modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations ‑‑ binding, bundling, and similarity‑based retrieval to construct associative cross‑modal representations in a single pass over the data. Caption generation emerges from high‑dimensional memory retrieval rather than iterative gradient‑based optimization. We show that HDFLIM achieves performance comparable to end‑to‑end vision‑language training methods and produces captions that are more semantically grounded than zero‑shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large‑scale retraining. The codebase for our implementation can be found at https://github.com/Abhishek‑Dalvi410/HDFLIM.

Authors:Xiang Ao
Title: SDMixer: Sparse Dual-Mixer for Time Series Forecasting
Abstract:
Multivariate time series forecasting is widely applied in fields such as transportation, energy, and finance. However, the data commonly suffers from issues of multi‑scale characteristics, weak correlations, and noise interference, which limit the predictive performance of existing models. This paper proposes a dual‑stream sparse Mixer prediction framework that extracts global trends and local dynamic features from sequences in both the frequency and time domains, respectively. It employs a sparsity mechanism to filter out invalid information, thereby enhancing the accuracy of cross‑variable dependency modeling. Experimental results demonstrate that this method achieves leading performance on multiple real‑world scenario datasets, validating its effectiveness and generality. The code is available at https://github.com/SDMixer/SDMixer

Authors:Aishwarya Sarkar, Sayan Ghosh, Nathan Tallent, Aman Chadha, Tanya Roosta, Ali Jannesari
Title: Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents
Abstract:
Large‑scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state‑of‑the‑art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In‑Context Learning (ICL) for zero‑shot tasks, with logical multi‑step reasoning. We find this behavior well‑suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end‑to‑end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder‑llm‑agent.

Authors:Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard, Davis Blalock
Title: FlashOptim: Optimizers for Memory Efficient Training
Abstract:
Standard mixed‑precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory. We introduce FlashOptim, a suite of optimizations that reduces per‑parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8‑bit optimizer state quantization. Together with 16‑bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half. Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama‑3.1‑8B finetuning.

Authors:Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Damon Conover, Ziran Wang, Aniket Bera
Title: Physics Informed Viscous Value Representations
Abstract:
Offline goal‑conditioned reinforcement learning (GCRL) learns goal‑conditioned policies from static pre‑collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state‑action space. Recent physics‑informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first‑order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill‑posed in complex, high‑dimensional environments. In this work, we propose a physics‑informed regularization derived from the viscosity solution of the Hamilton‑Jacobi‑Bellman (HJB) equation. By providing a physics‑based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman‑Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher‑order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high‑dimensional, complex manipulation tasks. Open‑source codes are available at https://github.com/HrishikeshVish/phys‑fk‑value‑GCRL.

Authors:Max S. Bennett, Thomas P. Zollo, Richard Zemel
Title: Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language
Abstract:
Modern machine learning models are deployed in diverse, non‑stationary environments where they must continually adapt to new tasks and evolving knowledge. Continual fine‑tuning and in‑context learning are costly and brittle, whereas neural memory methods promise lightweight updates with minimal forgetting. However, existing neural memory models typically assume a single fixed objective and homogeneous information streams, leaving users with no control over what the model remembers or ignores over time. To address this challenge, we propose a generalized neural memory system that performs flexible updates based on learning instructions specified in natural language. Our approach enables adaptive agents to learn selectively from heterogeneous information sources, supporting settings, such as healthcare and customer service, where fixed‑objective memory updates are insufficient.

Authors:Thomas Woergaard, Raghavendra Selvan
Title: FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification
Abstract:
Compressing neural networks by quantizing model parameters offers useful trade‑off between performance and efficiency. Methods like quantization‑aware training and post‑training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness‑aware mixed‑precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group‑aware importance analysis, budgeted mixed‑precision allocation, and a learnable Bit‑Aware Quantization (BAQ) mode that jointly optimizes weights and per‑unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT‑Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4‑6 bits recover much of the Uniform 8‑bit accuracy while improving worst‑group performance relative to Uniform 4‑ and 8‑bit baselines, with comparable fairness metrics under shared budgets.

Authors:Jayadev Billa
Title: Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Abstract:
Numerous studies have shown that multimodal LLMs process speech and images well but fail in non‑intuitive ways rendering trivial tasks such as object counting unreliable. We investigate this behavior from an information‑theoretic perspective by framing multimodal LLM inference as a mismatched decoder problem: a decoder trained primarily on text can only extract information along text‑aligned directions (removing up to 98% of the variation in modality‑specific (non‑text) directions improves decoder loss) and the amount of accessible information is bounded by the Generalized Mutual Information (GMI). We show that information loss is bounded as the distributional mismatch between the source data and the text data increases, and as the sensitivity of the decoder increases. This bound is a function of the model's scoring rule not its architecture. We validate the predictions across five models spanning speech and vision. A controlled study (two Prismatic VLMs differing only in encoder text‑alignment) shows that the bottleneck lies in the scoring rule of the decoder rather than the text‑alignment of the encoder or the learned projection. A LoRA intervention demonstrates that simply training with an emotion‑related objective improves emotion detection from 17.3% to 61.8% task accuracy without affecting other attributes, confirming that the training objective determines what becomes accessible.

Authors:Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He, Shihan Yu, Qianqian Xu, Bin Wang, Guoliang Li, Conghui He, Fan Wu
Title: MoDora: Tree-Based Semi-Structured Document Analysis System
Abstract:
Semi‑structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real‑world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout‑specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM‑powered system for semi‑structured document analysis. First, we adopt a local‑alignment aggregation strategy to convert OCR‑parsed elements into layout‑aware components, and conduct type‑specific information extraction for components with hierarchical titles or non‑text elements. Second, we design the Component‑Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter‑component relations and layout distinctions through a bottom‑up cascade summarization process. Finally, we propose a question‑type‑aware retrieval strategy that supports (1) layout‑based grid partitioning for location‑based retrieval and (2) LLM‑guided pruning for semantic‑based retrieval. Experiments show MoDora outperforms baselines by 5.97%‑61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.

Authors:Camile Lendering, Erkut Akdag, Egor Bondarev
Title: SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Abstract:
Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few‑shot methods achieve strong results employing foundation‑model features, but typically rely on memory banks, auxiliary datasets, or multi‑modal tuning of vision‑language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training‑free method, that operates in two simple stages. First, patch‑level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low‑dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state‑of‑the‑art performance across one‑shot and few‑shot settings without training, prompt tuning, or memory banks. In the one‑shot anomaly detection setting, SubspaceAD achieves image‑level and pixel‑level AUROC of 98.0% and 97.6% on the MVTec‑AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state‑of‑the‑art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.

Authors:Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An
Title: Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Abstract:
Group‑based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long‑horizon agentic tasks. To enable more fine‑grained policy updates, recent research has increasingly shifted toward stepwise group‑based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy‑of‑Groups Policy Optimization (HGPO) for long‑horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias‑variance trade‑off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5‑1.5B‑Instruct and Qwen2.5‑7B‑Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints. Code is available at https://github.com/langfengQ/verl‑agent/tree/master/recipe/hgpo.

Authors:Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu
Title: QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning
Abstract:
Value decomposition (VD) methods have achieved remarkable success in cooperative multi‑agent reinforcement learning (MARL). However, their reliance on the max operator for temporal‑difference (TD) target calculation leads to systematic Q‑value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q‑learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near‑greedy joint action space. This formulation allows the target to integrate Q‑values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl‑qsim.

Authors:Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan
Title: Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
Abstract:
Vision‑language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain‑specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator‑guided reinforcement learning (RL) framework. GeoDPO employs an NL‑to‑DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine‑grained, DSL‑level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in‑domain and out‑of‑domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine‑tuning (SFT) offers only marginal improvements and may even impair performance in out‑of‑domain scenarios, GeoDPO achieves substantial gains: +26.5% on in‑domain data, +8.0% on out‑of‑domain data, and +39.0% on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at https://github.com/Longin‑Yu/GeoPerceive to ensure reproducibility.

Authors:Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, Zaiwen Wen
Title: Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement
Abstract:
Pre‑training Large Language Models requires immense computational resources, making optimizer efficiency essential. The optimization landscape is highly anisotropic, with loss reduction driven predominantly by progress along flat directions. While matrix‑based optimizers such as Muon and SOAP leverage fine‑grained curvature information to outperform AdamW, their updates tend toward isotropy ‑‑ relatively conservative along flat directions yet potentially aggressive along sharp ones. To address this limitation, we first establish a unified Riemannian Ordinary Differential Equation (ODE) framework that elucidates how common adaptive algorithms operate synergistically: the preconditioner induces a Riemannian geometry that mitigates ill‑conditioning, while momentum serves as a Riemannian damping term that promotes convergence. Guided by these insights, we propose LITE, a generalized acceleration strategy that enhances training dynamics by applying larger Hessian damping coefficients and learning rates along flat trajectories. Extensive experiments demonstrate that LITE significantly accelerates both Muon and SOAP across diverse architectures (Dense, MoE), parameter scales (130M‑‑1.3B), datasets (C4, Pile), and learning‑rate schedules (cosine, warmup‑stable‑decay). Theoretical analysis confirms that LITE facilitates faster convergence along flat directions in anisotropic landscapes, providing a principled approach to efficient LLM pre‑training. The code is available at https://github.com/SHUCHENZHU/LITE.

Authors:Md Tanvir Hasan Turja
Title: Forecasting Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support
Abstract:
Antimicrobial resistance (AMR) is a growing global crisis projected to cause 10 million deaths per year by 2050. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized surveillance data across 44 countries, few studies have applied machine learning to forecast population‑level resistance trends from this data. This paper presents a two‑component framework for AMR trend forecasting and evidence‑grounded policy decision support. We benchmark six models ‑‑ Naive, Linear Regression, Ridge Regression, XGBoost, LightGBM, and LSTM ‑‑ on 5,909 WHO GLASS observations across six WHO regions (2021‑2023). XGBoost achieved the best performance with a test MAE of 7.07% and R‑squared of 0.854, outperforming the naive baseline by 83.1%. Feature importance analysis identified the prior‑year resistance rate as the dominant predictor (50.5% importance), while regional MAE ranged from 4.16% (European Region) to 10.14% (South‑East Asia Region). We additionally implemented a Retrieval‑Augmented Generation (RAG) pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi‑3 Mini language model, producing source‑attributed, hallucination‑constrained policy answers. Code and data are available at https://github.com/TanvirTurja

Authors:Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song
Title: dLLM: Simple Diffusion Language Modeling
Abstract:
Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad‑hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open‑source framework that unifies the core components of diffusion language modeling ‑‑ training, inference, and evaluation ‑‑ and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open‑source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT‑style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

Authors:Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt, Raghunandan Keshavan, Shao-Chuan Wang, Xinyang Yi, Mingyan Gao, Onkar Dalal, Lichan Hong, Ed Chi, Ningren Han
Title: Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Abstract:
Generative retrieval has emerged as a powerful paradigm for LLM‑based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix‑Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high‑throughput LLM‑based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large‑scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47‑1033x speedup over a hardware‑accelerated binary‑search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production‑scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold‑start performance for generative retrieval. Our code is available at https://github.com/youtube/static‑constraint‑decoding.

Authors:Hai Huang, Yann LeCun, Randall Balestriero
Title: Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA
Abstract:
Large Language Models (LLMs) obey consistent scaling laws ‑‑ empirical power‑law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data‑efficiency bounds implied by these laws ‑‑ which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA‑style regularizer that confines hidden‑state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi‑view augmentations. We show this constraint improves signal‑to‑noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16× less training data on the NL‑RX‑SYNTH dataset, directly violating the data term of Chinchilla‑style scaling laws and demonstrating that principled geometric priors can surpass brute‑force scaling. Code is available at https://github.com/galilai‑group/llm‑jepa#stp.

Authors:Jessie Yuan, Yilin Wu, Andrea Bajcsy
Title: When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering
Abstract:
Policy steering is an emerging way to adapt robot behaviors at deployment‑time: a learned verifier analyzes low‑level action samples proposed by a pre‑trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision‑Language Models (VLMs) are promising general‑purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well‑calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high‑level semantic uncertainty in task specifications and low‑level action uncertainty or incapability of the pre‑trained policy. We propose uncertainty‑aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low‑level action feasibility, and selects an uncertainty resolution strategy: execute a high‑confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low‑level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre‑trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre‑trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human‑ or robot‑gated continual learning approaches. Videos can be found at https://jessie‑yuan.github.io/ups/

Authors:Emilio Ferrara
Title: ECHO: Encoding Communities via High-order Operators
Abstract:
Community detection in attributed networks faces a fundamental divide: topological algorithms ignore semantic features, while Graph Neural Networks (GNNs) encounter devastating computational bottlenecks. Specifically, GNNs suffer from a Semantic Wall of feature over smoothing in dense or heterophilic networks, and a Systems Wall driven by the O(N^2) memory constraints of pairwise clustering. To dismantle these barriers, we introduce ECHO (Encoding Communities via High order Operators), a scalable, self supervised architecture that reframes community detection as an adaptive, multi scale diffusion process. ECHO features a Topology Aware Router that automatically analyzes structural heuristics sparsity, density, and assortativity to route graphs through the optimal inductive bias, preventing heterophilic poisoning while ensuring semantic densification. Coupled with a memory sharded full batch contrastive objective and a novel chunked O(N \cdot K) similarity extraction method, ECHO completely bypasses traditional O(N^2) memory bottlenecks without sacrificing the mathematical precision of global gradients. Extensive evaluations demonstrate that this topology feature synergy consistently overcomes the classical resolution limit. On synthetic LFR benchmarks scaled up to 1 million nodes, ECHO achieves scale invariant accuracy despite severe topological noise. Furthermore, on massive real world social networks with over 1.6 million nodes and 30 million edges, it completes clustering in mere minutes with throughputs exceeding 2,800 nodes per second matching the speed of highly optimized purely topological baselines. The implementation utilizes a unified framework that automatically engages memory sharded optimization to support adoption across varying hardware constraints. GitHub Repository: https://github.com/emilioferrara/ECHO‑GNN

Authors:Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao
Title: SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read
Abstract:
Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized‑Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5‑VL reveal a startling capability‑utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep‑seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug‑and‑play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text‑based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL‑based methods. Furthermore, its plug‑and‑play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming‑lab/SimpleOCR.

Authors:Dawei Su, Dongsheng Wang
Title: RETLLM: Training and Data-Free MLLMs for Multimodal Information Retrieval
Abstract:
Multimodal information retrieval (MMIR) has gained attention for its flexibility in handling text, images, or mixed queries and candidates. Recent breakthroughs in multimodal large language models (MLLMs) boost MMIR performance by incorporating MLLM knowledge under the contrastive finetuning framework. However, they suffer from pre‑training inconsistency and require large datasets. In this work, we introduce a novel framework, RetLLM, designed to query MLLMs for MMIR in a training‑ and data‑free manner. Specifically, we formulate MMIR as a similarity score generation task and prompt MLLMs to directly predict retrieval scores in a coarse‑then‑fine pipeline. At the coarse stage, a top‑k filtering strategy builds a small yet high‑quality candidate pool for each query, enabling MLLMs to focus on semantically relevant candidates. Subsequently, the retrieval score is predicted by feeding both the query and candidate into MLLMs at the fine stage. Importantly, we propose a visual enhancement module during reasoning to help MLLMs re‑pick forgotten visuals, improving retrieval. Extensive experiments on MMIR benchmarks show that RetLLM outperforms fine‑tuned models. Ablation studies further verify each component. Our work demonstrates that MLLMs can achieve strong MMIR performance without any training, highlighting their inherent multimodal reasoning ability in a simple, scalable framework. We release our code at: https://github.com/alivecat05/RETLLM

Authors:Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney
Title: Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials
Abstract:
General‑purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom‑1, the first end‑to‑end, fully open‑source foundation model that unifies generative and predictive learning of 3D molecules and materials. Zatom‑1 is a Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use joint generative pretraining as a universal initialization for downstream multi‑task prediction of properties, energies, and forces. Empirically, Zatom‑1 matches or outperforms specialized baselines on both generative and predictive benchmarks, while reducing the generative inference time by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between chemical domains from joint generative pretraining: modeling materials during pretraining improves molecular property prediction accuracy.

Authors:Ida Egendal, Rasmus Froberg Brøndum, Dan J Woodcock, Christopher Yau, Martin Bøgsted
Title: VAE-MS: An Asymmetric Variational Autoencoder for Mutational Signature Extraction
Abstract:
Mutational signature analysis has emerged as a powerful method for uncovering the underlying biological processes driving cancer development. However, the signature extraction process, typically performed using non‑negative matrix factorization (NMF), often lacks reliability and clinical applicability. To address these limitations, several solutions have been introduced, including the use of neural networks to achieve more accurate estimates and probabilistic methods to better capture natural variation in the data. In this work, we introduce a Variational Autoencoder for Mutational Signatures (VAE‑MS), a novel model that leverages both an asymmetric architecture and probabilistic methods for the extraction of mutational signatures. VAE‑MS is compared to with three state‑of‑the‑art models for mutational signature extraction: SigProfilerExtractor, the NMF‑based gold standard; MUSE‑XAE, an autoencoder that employs an asymmetric design without probabilistic components; and SigneR, a Bayesian NMF model, to illustrate the strength in combining a nonlinear extraction with a probabilistic model. In the ability to reconstruct input data and generalize to unseen data, models with probabilistic components (VAE‑MS, SigneR) dramatically outperformed models without (SigProfilerExtractor, MUSE‑XAE). The NMF‑baed models (SigneR, SigProfilerExtractor) had the most accurate reconstructions in simulated data, while VAE‑MS reconstructed more accurately on real cancer data. Upon evaluating the ability to extract signatures consistently, no model exhibited a clear advantage over the others. Software for VAE‑MS is available at https://github.com/CLINDA‑AAU/VAE‑MS.

Authors:Pantia-Marina Alchirch, Dimitrios I. Diochnos
Title: On Imbalanced Regression with Hoeffding Trees
Abstract:
Many real‑world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long‑standing tradition due to their effectiveness, either alone or as base models in broader ensembles. Recent batch‑learning work shows that kernel density estimation (KDE) improves smoothed predictions in imbalanced regression [Yang et al., 2021], while hierarchical shrinkage (HS) provides post‑hoc regularization for decision trees without modifying their structure [Agarwal et al., 2022]. We extend KDE to streaming settings via a telescoping formulation and integrate HS into incremental decision trees. Empirical evaluation on standard online regression benchmarks shows that KDE consistently improves early‑stream performance, whereas HS provides limited gains. Our implementation is publicly available at: https://github.com/marinaAlchirch/DSFA_2026.

Authors:Jinpeng Li, Zhongyi Pei, Huaze Xue, Bojian Zheng, Chen Wang, Jianmin Wang
Title: DualWeaver: Synergistic Feature Weaving Surrogates for Multivariate Forecasting with Univariate Time Series Foundation Models
Abstract:
Time‑series foundation models (TSFMs) have achieved strong univariate forecasting through large‑scale pre‑training, yet effectively extending this success to multivariate forecasting remains challenging. To address this, we propose DualWeaver, a novel framework that adapts univariate TSFMs (Uni‑TSFMs) for multivariate forecasting by using a pair of learnable, structurally symmetric surrogate series. Generated by a shared auxiliary feature‑fusion module that captures cross‑variable dependencies, these surrogates are mapped to TSFM‑compatible series via the forecasting objective. The symmetric structure enables parameter‑free reconstruction of final predictions directly from the surrogates, without additional parametric decoding. A theoretically grounded regularization term is further introduced to enhance robustness against adaptation collapse. Extensive experiments on diverse real‑world datasets show that DualWeaver outperforms state‑of‑the‑art multivariate forecasters in both accuracy and stability. We release the code at https://github.com/li‑jinpeng/DualWeaver.

Authors:Sterre de Jonge, Elisabeth J. Vinke, Meike W. Vernooij, Daniel C. Alexander, Alexandra L. Young, Esther E. Bron
Title: Disease Progression and Subtype Modeling for Combined Discrete and Continuous Input Data
Abstract:
Disease progression modeling provides a robust framework to identify long‑term disease trajectories from short‑term biomarker data. It is a valuable tool to gain a deeper understanding of diseases with a long disease trajectory, such as Alzheimer's disease. A key limitation of most disease progression models is that they are specific to a single data type (e.g., continuous data), thereby limiting their applicability to heterogeneous, real‑world datasets. To address this limitation, we propose the Mixed Events model, a novel disease progression model that handles both discrete and continuous data types. This model is implemented within the Subtype and Stage Inference (SuStaIn) framework, resulting in Mixed‑SuStaIn, enabling subtype and progression modeling. We demonstrate the effectiveness of Mixed‑SuStaIn through simulation experiments and real‑world data from the Alzheimer's Disease Neuroimaging Initiative, showing that it performs well on mixed datasets. The code is available at: https://github.com/ucl‑pond/pySuStaIn.

Authors:Cuong Anh Pham, Praneeth Vepakomma, Samuel Horváth
Title: Learning in the Null Space: Small Singular Values for Continual Learning
Abstract:
Alleviating catastrophic forgetting while enabling further learning is a primary challenge in continual learning (CL). Orthogonal‑based training methods have gained attention for their efficiency and strong theoretical properties, and many existing approaches enforce orthogonality through gradient projection. In this paper, we revisit orthogonality and exploit the fact that small singular values correspond to directions that are nearly orthogonal to the input space of previous tasks. Building on this principle, we introduce NESS (Null‑space Estimated from Small Singular values), a CL method that applies orthogonality directly in the weight space rather than through gradient manipulation. Specifically, NESS constructs an approximate null space using the smallest singular values of each layer's input representation and parameterizes task‑specific updates via a compact low‑rank adaptation (LoRA‑style) formulation constrained to this subspace. The subspace basis is fixed to preserve the null‑space constraint, and only a single trainable matrix is learned for each task. This design ensures that the resulting updates remain approximately in the null space of previous inputs while enabling adaptation to new tasks. Our theoretical analysis and experiments on three benchmark datasets demonstrate competitive performance, low forgetting, and stable accuracy across tasks, highlighting the role of small singular values in continual learning. The code is available at https://github.com/pacman‑ctm/NESS.

Authors:Lin Zhu, Lei You
Title: xai-cola: A Python library for sparsifying counterfactual explanations
Abstract:
Counterfactual explanation (CE) is an important domain within post‑hoc explainability. However, the explanations generated by most CE generators are often highly redundant. This work introduces an open‑source Python library xai‑cola, which provides an end‑to‑end pipeline for sparsifying CEs produced by arbitrary generators, reducing superfluous feature changes while preserving their validity. It offers a documented API that takes as input raw tabular data in pandas DataFrame form, a preprocessing object (for standardization and encoding), and a trained scikit‑learn or PyTorch model. On this basis, users can either employ the built‑in or externally imported CE generators. The library also implements several sparsification policies and includes visualization routines for analysing and comparing sparsified counterfactuals. xai‑cola is released under the MIT license and can be installed from PyPI. Empirical experiments indicate that xai‑cola produces sparser counterfactuals across several CE generators, reducing the number of modified features by up to 50% in our setting. The source code is available at https://github.com/understanding‑ml/COLA.

Authors:Xiannan Huang, Quan Yuan, Chao Yang
Title: Learning from Yesterday's Error: An Efficient Online Learning Method for Traffic Demand Prediction
Abstract:
Accurately predicting short‑term traffic demand is critical for intelligent transportation systems. While deep learning models achieve strong performance under stationary conditions, their accuracy often degrades significantly when faced with distribution shifts caused by external events or evolving urban dynamics. Frequent model retraining to adapt to such changes incurs prohibitive computational costs, especially for large‑scale or foundation models. To address this challenge, we propose FORESEE (Forecasting Online with Residual Smoothing and Ensemble Experts), a lightweight online adaptation framework that is accurate, robust, and computationally efficient. FORESEE operates without any parameter updates to the base model. Instead, it corrects today's forecast in each region using yesterday's prediction error, stabilized through exponential smoothing guided by a mixture‑of‑experts mechanism that adapts to recent error dynamics. Moreover, an adaptive spatiotemporal smoothing component propagates error signals across neighboring regions and time slots, capturing coherent shifts in demand patterns. Extensive experiments on seven real‑world datasets with three backbone models demonstrate that FORESEE consistently improves prediction accuracy, maintains robustness even when distribution shifts are minimal (avoiding performance degradation), and achieves the lowest computational overhead among existing online methods. By enabling real‑time adaptation of traffic forecasting models with negligible computational cost, FORESEE paves the way for deploying reliable, up‑to‑date prediction systems in dynamic urban environments. Code and data are available at https://github.com/xiannanhuang/FORESEE

Authors:Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang
Title: Muon+: Towards Better Muon via One Additional Normalization Step
Abstract:
The Muon optimizer has demonstrated promising performance in pre‑training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre‑training experiments across a wide range of model scales and architectures. Our evaluation includes GPT‑style models ranging from 130M to 774M parameters and LLaMA‑style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute‑optimal training regime and further extend the token‑to‑parameter (T2P) ratio to an industrial level of \approx 200. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

Authors:Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir
Title: LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Abstract:
General‑purpose robots must master long‑horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision‑Language‑Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo‑VLA (Linked Local VLA), a modular framework capable of zero‑shot generalization to novel long‑horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object‑centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end‑to‑end approaches. We introduce a 21‑task simulation benchmark consisting of two challenging suites: LIBERO‑Long++ and Ultra‑Long. In these simulations, LiLo‑VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA‑OFT by 67%. Furthermore, real‑world evaluations across 8 long‑horizon tasks demonstrate an average success rate of 85%. Project page: https://yy‑gx.github.io/LiLo‑VLA/.

Authors:Boyuan Li, Zhen Liu, Yicheng Luo, Qianli Ma
Title: Learning Recursive Multi-Scale Representations for Irregular Multivariate Time Series Forecasting
Abstract:
Irregular Multivariate Time Series (IMTS) are characterized by uneven intervals between consecutive timestamps, which carry sampling pattern information valuable and informative for learning temporal and variable dependencies. In addition, IMTS often exhibit diverse dependencies across multiple time scales. However, many existing multi‑scale IMTS methods use resampling to obtain the coarse series, which can alter the original timestamps and disrupt the sampling pattern information. To address the challenge, we propose ReIMTS, a Recursive multi‑scale modeling approach for Irregular Multivariate Time Series forecasting. Instead of resampling, ReIMTS keeps timestamps unchanged and recursively splits each sample into subsamples with progressively shorter time periods. Based on the original sampling timestamps in these long‑to‑short subsamples, an irregularity‑aware representation fusion mechanism is proposed to capture global‑to‑local dependencies for accurate forecasting. Extensive experiments demonstrate an average performance improvement of 27.1% in the forecasting task across different models and real‑world datasets. Our code is available at https://github.com/Ladbaby/PyOmniTS.

Authors:Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang
Title: GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Abstract:
Reinforcement learning (RL) has become a central post‑training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non‑stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine‑tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low‑utility problems. We propose GradAlign, a gradient‑aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low‑utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non‑stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign

Authors:Jesse He, Helen Jenne, Max Vargas, Davis Brown, Gal Mishne, Yusu Wang, Henry Kvinge
Title: MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning
Abstract:
The recent field of neural algorithmic reasoning (NAR) studies the ability of graph neural networks (GNNs) to emulate classical algorithms like Bellman‑Ford, a phenomenon known as algorithmic alignment. At the same time, recent advances in large language models (LLMs) have spawned the study of mechanistic interpretability, which aims to identify granular model components like circuits that perform specific computations. In this work, we introduce Mechanistic Interpretability for Neural Algorithmic Reasoning (MINAR), an efficient circuit discovery toolbox that adapts attribution patching methods from mechanistic interpretability to the GNN setting. We show through two case studies that MINAR recovers faithful neuron‑level circuits from GNNs trained on algorithmic tasks. Our study sheds new light on the process of circuit formation and pruning during training, as well as giving new insight into how GNNs trained to perform multiple tasks in parallel reuse circuit components for related tasks. Our code is available at https://github.com/pnnl/MINAR.

Authors:Jan Pauls, Karsten Schrödter, Sven Ligensa, Martin Schwartz, Berkant Turan, Max Zimmer, Sassan Saatchi, Sebastian Pokutta, Philippe Ciais, Fabian Gieseke
Title: ECHOSAT: Estimating Canopy Height Over Space And Time
Abstract:
Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi‑sensor satellite data to train a specialized vision transformer model, which performs pixel‑level temporal regression. A self‑supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state‑of‑the‑art accuracies in the context of single‑year predictions. We also provide the first global‑scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at https://github.com/ai4forest/echosat.

Authors:Alina Devkota, Jacob Thrasher, Donald Adjeroh, Binod Bhattarai, Prashnna K. Gyawali
Title: FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
Abstract:
Federated Learning (FL) enables collaborative model training across multiple clients without sharing their private data. However, data heterogeneity across clients leads to client drift, which degrades the overall generalization performance of the model. This effect is further compounded by overemphasis on poorly performing clients. To address this problem, we propose FedVG, a novel gradient‑based federated aggregation framework that leverages a global validation set to guide the optimization process. Such a global validation set can be established using readily available public datasets, ensuring accessibility and consistency across clients without compromising privacy. In contrast to conventional approaches that prioritize client dataset volume, FedVG assesses the generalization ability of client models by measuring the magnitude of validation gradients across layers. Specifically, we compute layerwise gradient norms to derive a client‑specific score that reflects how much each client needs to adjust for improved generalization on the global validation set, thereby enabling more informed and adaptive federated aggregation. Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings. Moreover, FedVG is modular and can be seamlessly integrated with various state‑of‑the‑art FL algorithms, often further improving their results. Our code is available at https://github.com/alinadevkota/FedVG.

Authors:Subhadip Mitra
Title: Field-Theoretic Memory for AI Agents: Continuous Dynamics for Context Preservation
Abstract:
We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi‑agent scenarios. We evaluate the system on two established long‑context benchmarks: LoCoMo (ACL 2024) with 300‑turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi‑session reasoning over 500+ turns. On LongMemEval, the field‑theoretic approach achieves significant improvements: +116% F1 on multi‑session reasoning (p<0.01, d= 3.06), +43.8% on temporal reasoning (p<0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p<0.001, d= 5.00). Multi‑agent experiments show near‑perfect collective intelligence (>99.8%) through field coupling. Code is available at github.com/rotalabs/rotalabs‑fieldmem.

Authors:Abdulaziz Almuzairee, Henrik I. Christensen
Title: Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics
Abstract:
Visual reinforcement learning is appealing for robotics but expensive ‑‑ off‑policy methods are sample‑efficient yet slow; on‑policy methods parallelize well but waste samples. Recent work has shown that off‑policy methods can train faster than on‑policy methods in wall‑clock time for state‑based control. Extending this to vision remains challenging, where high‑dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall‑clock training than prior visual off‑policy and on‑policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update‑to‑data ratio, and an optimized implementation. We evaluate on the SO‑101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim‑to‑real transfer to a real SO‑101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.

Authors:Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong
Title: Aletheia tackles FirstProof autonomously
Abstract:
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google‑deepmind/superhuman/tree/main/aletheia.

Authors:Duowen Chen, Yan Wang
Title: ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning
Abstract:
Federated Semi‑Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially‑annotated local data in a privacy‑preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (external) and / or filter out low‑confidence unlabeled samples to reduce mistakes in local client (internal). But, the former is hard to precisely fit the ideal global distribution via direct weights, and the latter results in fewer data participation into FL training. To this end, we propose a proxy‑guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy against outliers instead of direct weights; for internal, we re‑include the discarded samples into training by a positive‑negative proxy pool to mitigate the impact of potentially‑incorrect pseudo‑labels. Insight experiments & theoretical analysis show our significant performance and convergence in FSSL.

Authors:Dongik Park, Hyunwoo Ryu, Suahn Bae, Keondo Park, Hyung-Sin Kim
Title: T1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation
Abstract:
Imputing missing values in multivariate time series remains challenging, especially under diverse missing patterns and heavy missingness. Existing methods suffer from suboptimal performance as corrupted temporal features hinder effective cross‑variable information transfer, amplifying reconstruction errors. Robust imputation requires both extracting temporal patterns from sparse observations within each variable and selectively transferring information across variables‑‑yet current approaches excel at one while compromising the other. We introduce T1 (Time series imputation with 1‑to‑1 channel‑head binding), a CNN‑Transformer hybrid architecture that achieves robust imputation through Channel‑Head Binding‑‑a mechanism creating one‑to‑one correspondence between CNN channels and attention heads. This design enables selective information transfer: when missingness corrupts certain temporal patterns, their corresponding attention pathways adaptively down‑weight based on remaining observable patterns while preserving reliable cross‑variable connections through unaffected channels. Experiments on 11 benchmark datasets demonstrate that T1 achieves state‑of‑the‑art performance, reducing MSE by 46% on average compared to the second‑best baseline, with particularly strong gains under extreme sparsity (70% missing ratio). The model generalizes to unseen missing patterns without retraining and uses a consistent hyperparameter configuration across all datasets. The code is available at https://github.com/Oppenheimerdinger/T1.

Authors:Tianhao Fu, Yucheng Chen
Title: MIP Candy: A Modular PyTorch Framework for Medical Image Processing
Abstract:
Medical image processing demands specialized software that handles high‑dimensional volumetric data, heterogeneous file formats, and domain‑specific training procedures. Existing frameworks either provide low‑level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch‑based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, \textttbuild_network, while retaining fine‑grained control over every component. Central to the design is \textttLayerT, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built‑in k‑fold cross‑validation, dataset inspection with automatic region‑of‑interest detection, deep supervision, exponential moving average, multi‑frontend experiment tracking (Weights & Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre‑built model implementations that follow a consistent trainer‑‑predictor pattern and integrate with the core framework without modification. MIPCandy is open‑source under the Apache‑2.0 license and requires Python~3.12 or later. Source code and documentation are available at https://github.com/ProjectNeura/MIPCandy.

Authors:Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song
Title: SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Abstract:
Vision‑Language Models (VLMs) have been increasingly applied in real‑world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real‑world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi‑step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real‑world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task‑relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.

Authors:Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro
Title: Communication-Inspired Tokenization for Structured Image Representations
Abstract:
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer‑based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object‑level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow‑matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end‑to‑end using a combination of flow‑matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object‑centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

Authors:Jiawei Wang, Chuang Yang, Jiawei Yong, Xiaohang Xu, Hongjun Wang, Noboru Koshizuka, Shintaro Fukushima, Ryosuke Shibasaki, Renhe Jiang
Title: TrajGPT-R: Generating Urban Mobility Trajectory with Reinforcement Learning-Enhanced Generative Pre-trained Transformer
Abstract:
Mobility trajectories are essential for understanding urban dynamics and enhancing urban planning, yet access to such data is frequently hindered by privacy concerns. This research introduces a transformative framework for generating large‑scale urban mobility trajectories, employing a novel application of a transformer‑based model pre‑trained and fine‑tuned through a two‑phase process. Initially, trajectory generation is conceptualized as an offline reinforcement learning (RL) problem, with a significant reduction in vocabulary space achieved during tokenization. The integration of Inverse Reinforcement Learning (IRL) allows for the capture of trajectory‑wise reward signals, leveraging historical data to infer individual mobility preferences. Subsequently, the pre‑trained model is fine‑tuned using the constructed reward model, effectively addressing the challenges inherent in traditional RL‑based autoregressive methods, such as long‑term credit assignment and handling of sparse reward environments. Comprehensive evaluations on multiple datasets illustrate that our framework markedly surpasses existing models in terms of reliability and diversity. Our findings not only advance the field of urban mobility modeling but also provide a robust methodology for simulating urban data, with significant implications for traffic management and urban development planning. The implementation is publicly available at https://github.com/Wangjw6/TrajGPT_R.

Authors:Mohammed Rakib, Luke Vaughan, Shivang Patel, Flera Rizatdinova, Alexander Khanov, Atriya Sen
Title: PhyGHT: Physics-Guided HyperGraph Transformer for Signal Purification at the HL-LHC
Abstract:
The High‑Luminosity Large Hadron Collider (HL‑LHC) at CERN will produce unprecedented datasets capable of revealing fundamental properties of the universe. However, realizing its discovery potential faces a significant challenge: extracting small signal fractions from overwhelming backgrounds dominated by approximately 200 simultaneous pileup collisions. This extreme noise severely distorts the physical observables required for accurate reconstruction. To address this, we introduce the Physics‑Guided Hypergraph Transformer (PhyGHT), a hybrid architecture that combines distance‑aware local graph attention with global self‑attention to mirror the physical topology of particle showers formed in proton‑proton collisions. Crucially, we integrate a Pileup Suppression Gate (PSG), an interpretable, physics‑constrained mechanism that explicitly learns to filter soft noise prior to hypergraph aggregation. To validate our approach, we release a novel simulated dataset of top‑quark pair production to model extreme pileup conditions. PhyGHT outperforms state‑of‑the‑art baselines from the ATLAS and CMS experiments in predicting the signal's energy and mass correction factors. By accurately reconstructing the top quark's invariant mass, we demonstrate how machine learning innovation and interdisciplinary collaboration can directly advance scientific discovery at the frontiers of experimental physics and enhance the HL‑LHC's discovery potential. The dataset and code are available at https://github.com/rAIson‑Lab/PhyGHT

Authors:Bolin Shen, Zhan Cheng, Neil Zhenqiang Gong, Fan Yao, Yushun Dong
Title: CREDIT: Certified Ownership Verification of Deep Neural Networks Against Model Extraction Attacks
Abstract:
Machine Learning as a Service (MLaaS) has emerged as a widely adopted paradigm for providing access to deep neural network (DNN) models, enabling users to conveniently leverage these models through standardized APIs. However, such services are highly vulnerable to Model Extraction Attacks (MEAs), where an adversary repeatedly queries a target model to collect input‑output pairs and uses them to train a surrogate model that closely replicates its functionality. While numerous defense strategies have been proposed, verifying the ownership of a suspicious model with strict theoretical guarantees remains a challenging task. To address this gap, we introduce CREDIT, a certified ownership verification against MEAs. Specifically, we employ mutual information to quantify the similarity between DNN models, propose a practical verification threshold, and provide rigorous theoretical guarantees for ownership verification based on this threshold. We extensively evaluate our approach on several mainstream datasets across different domains and tasks, achieving state‑of‑the‑art performance. Our implementation is publicly available at: https://github.com/LabRAI/CREDIT.

Authors:Bolin Shen, Md Shamim Seraj, Zhan Cheng, Shayok Chakraborty, Yushun Dong
Title: CITED: A Decision Boundary-Aware Signature for GNNs Towards Model Extraction Defense
Abstract:
Graph neural networks (GNNs) have demonstrated superior performance in various applications, such as recommendation systems and financial risk management. However, deploying large‑scale GNN models locally is particularly challenging for users, as it requires significant computational resources and extensive property data. Consequently, Machine Learning as a Service (MLaaS) has become increasingly popular, offering a convenient way to deploy and access various models, including GNNs. However, an emerging threat known as Model Extraction Attacks (MEAs) presents significant risks, as adversaries can readily obtain surrogate GNN models exhibiting similar functionality. Specifically, attackers repeatedly query the target model using subgraph inputs to collect corresponding responses. These input‑output pairs are subsequently utilized to train their own surrogate models at minimal cost. Many techniques have been proposed to defend against MEAs, but most are limited to specific output levels (e.g., embedding or label) and suffer from inherent technical drawbacks. To address these limitations, we propose a novel ownership verification framework CITED which is a first‑of‑its‑kind method to achieve ownership verification on both embedding and label levels. Moreover, CITED is a novel signature‑based method that neither harms downstream performance nor introduces auxiliary models that reduce efficiency, while still outperforming all watermarking and fingerprinting approaches. Extensive experiments demonstrate the effectiveness and robustness of our CITED framework. Code is available at: https://github.com/LabRAI/CITED.

Authors:Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee
Title: CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
Abstract:
Recent vision‑language models (VLMs) such as CLIP demonstrate impressive cross‑modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real‑world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few‑shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge‑driven prompt tuning scheme that integrates high‑level language priors with geometric cues from a lightweight 3D encoder. To adapt task‑specific features effectively, we apply parameter‑efficient fine‑tuning to CLIP's encoders and design an entropy‑guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport‑based alignment loss and an uncertainty‑aware prototype alignment loss collaboratively bridge source‑target distribution gaps while maintaining class separability. Extensive experiments on PointDA‑10 and GraspNetPC‑10 benchmarks show that CLIPoint3D achieves consistent 3‑16% accuracy gains over both CLIP‑based and conventional encoder‑based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.

Authors:Haixu Wu, Minghao Guo, Zongyi Li, Zhiyang Dou, Mingsheng Long, Kaiming He, Wojciech Matusik
Title: GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training
Abstract:
Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating high‑fidelity training data. Pre‑training on abundant off‑the‑shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre‑trained model for general physics simulation based on lifted geometric pre‑training. The core idea is to augment geometry with synthetic dynamics, enabling dynamics‑aware self‑supervision without physics labels. Pre‑trained on over one million samples, GeoPT consistently improves industrial‑fidelity benchmarks spanning fluid mechanics for cars, aircraft, and ships, and solid mechanics in crash simulation, reducing labeled data requirements by 20‑60% and accelerating convergence by 2×. These results show that lifting with synthetic dynamics bridges the geometry‑physics gap, unlocking a scalable path for neural simulation and potentially beyond. Code is available at https://github.com/Physics‑Scaling/GeoPT.

Authors:Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, Peter Hase
Title: The Truthfulness Spectrum Hypothesis
Abstract:
Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain‑general to narrowly domain‑specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation‑inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation‑inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain‑general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near‑perfectly predicts cross‑domain generalization (R^2=0.98). Concept‑erasure methods further isolate truth directions that are (1) domain‑general, (2) domain‑specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain‑specific directions steer more effectively than domain‑general ones. Finally, post‑training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post‑training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

Authors:Wall Kim, Chaeyoung Song, Hanul Kim
Title: MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Abstract:
Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi‑Modal Prior‑data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non‑tabular modalities in a unified manner. MMPFN comprises per‑modality encoders, modality projectors, and pre‑trained foundation models. The modality projectors serve as the critical bridge, transforming non‑tabular embeddings into tabular‑compatible tokens for unified processing. To this end, we introduce a multi‑head gated MLP and a cross‑attention pooler that extract richer context from non‑tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general‑purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state‑of‑the‑art methods and effectively exploits non‑tabular modalities alongside tabular features. These results highlight the promise of extending prior‑data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning. The source code is available at https://github.com/too‑z/MultiModalPFN.

Authors:Soumik Deb Niloy, Md. Fahmid-Ul-Alam Juboraj, Swakkhar Shatabda
Title: KEMP-PIP: A Feature-Fusion Based Approach for Pro-inflammatory Peptide Prediction
Abstract:
Pro‑inflammatory peptides (PIPs) play critical roles in immune signaling and inflammation but are difficult to identify experimentally due to costly and time‑consuming assays. To address this challenge, we present KEMP‑PIP, a hybrid machine learning framework that integrates deep protein embeddings with handcrafted descriptors for robust PIP prediction. Our approach combines contextual embeddings from pretrained ESM protein language models with multi‑scale k‑mer frequencies, physicochemical descriptors, and modlAMP sequence features. Feature pruning and class‑weighted logistic regression manage high dimensionality and class imbalance, while ensemble averaging with an optimized decision threshold enhances the sensitivity‑‑specificity balance. Through systematic ablation studies, we demonstrate that integrating complementary feature sets consistently improves predictive performance. On the standard benchmark dataset, KEMP‑PIP achieves an MCC of 0.505, accuracy of 0.752, and AUC of 0.762, outperforming ProIn‑fuse, MultiFeatVotPIP, and StackPIP. Relative to StackPIP, these results represent improvements of 9.5% in MCC and 4.8% in both accuracy and AUC. The KEMP‑PIP web server is freely available at https://nilsparrow1920‑kemp‑pip.hf.space/ and the full implementation at https://github.com/S18‑Niloy/KEMP‑PIP.

Authors:Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han
Title: Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
Abstract:
Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi‑modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over‑exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid‑policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution‑aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on‑policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model's policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in‑domain and out‑of‑domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid‑policy RLVR training. Code is available at https://github.com/zhh6425/CalibRL.

Authors:Lingwei Gu, Nour Jedidi, Jimmy Lin
Title: NanoKnow: How to Know What Your Language Model Knows
Abstract:
How do large language models (LLMs) know what they know? Answering this question has been difficult because pre‑training data is often a "black box" ‑‑ unknown or inaccessible. The recent release of nanochat ‑‑ a family of small LLMs with fully open pre‑training data ‑‑ addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre‑training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed‑book accuracy is strongly influenced by answer frequency in the pre‑training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre‑training, demonstrating that parametric and external knowledge are complementary, and (4) non‑relevant information is harmful, with accuracy decreasing based on both the position and the number of non‑relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.

Authors:Harry Anthony, Ziyun Liang, Hermione Warr, Konstantinos Kamnitsas
Title: The Invisible Gorilla Effect in Out-of-distribution Detection
Abstract:
Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out‑of‑distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard‑to‑detect artefacts (near‑OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model's ROI and drops when it does not ‑ a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour‑swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: https://github.com/HarryAnthony/Invisible_Gorilla_Effect.

Authors:Ha-Anh Hoang Nguyen, Tri-Duc Phan Le, Duc-Hoang Pham, Huy-Son Nguyen, Cam-Van Thi Nguyen, Duc-Trong Le, Hoang-Quynh Le
Title: Counterfactual Understanding via Retrieval-aware Multimodal Modeling for Time-to-Event Survival Prediction
Abstract:
This paper tackles the problem of time‑to‑event counterfactual survival prediction, aiming to optimize individualized survival outcomes in the presence of heterogeneity and censored data. We propose CURE, a framework that advances counterfactual survival modeling via comprehensive multimodal embedding and latent subgroup retrieval. CURE integrates clinical, paraclinical, demographic, and multi‑omics information, which are aligned and fused through cross‑attention mechanisms. Complex multi‑omics signals can be adaptively refined using a mixture‑of‑experts architecture, emphasizing the most informative omics components. Building upon this representation, CURE implicitly retrieves patient‑specific latent subgroups that capture both baseline survival dynamics and treatment‑dependent variations. Experimental results on METABRIC and TCGA‑LUAD datasets demonstrate that proposed CURE model consistently outperforms strong baselines in survival analysis, evaluated using the Time‑dependent Concordance Index (C^td) and Integrated Brier Score (IBS). These findings highlight the potential of CURE to enhance multimodal understanding and serve as a foundation for future treatment recommendation models. All code and related resources are publicly available to facilitate the reproducibility https://github.com/L2R‑UET/CURE.

Authors:Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
Title: DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
Abstract:
Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path‑level diversity, leading to weak and unstable learning signals in group‑based policy optimization. We propose DSDR, a Dual‑Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length‑invariant, token‑level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global‑to‑local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group‑based optimization, and yields a principled global‑to‑local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual‑scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.

Authors:Johanna S. Fröhlich, Bastian Heinlein, Jan U. Claar, Hans Rosenberger, Vasileios Belagiannis, Ralf R. Müller
Title: The Confusion is Real: GRAPHIC - A Network Science Approach to Confusion Matrices in Deep Learning
Abstract:
Explainable artificial intelligence has emerged as a promising field of research to address reliability concerns in artificial intelligence. Despite significant progress in explainable artificial intelligence, few methods provide a systematic way to visualize and understand how classes are confused and how their relationships evolve as training progresses. In this work, we present GRAPHIC, an architecture‑agnostic approach that analyzes neural networks on a class level. It leverages confusion matrices derived from intermediate layers using linear classifiers. We interpret these as adjacency matrices of directed graphs, allowing tools from network science to visualize and quantify learning dynamics across training epochs and intermediate layers. GRAPHIC provides insights into linear class separability, dataset issues, and architectural behavior, revealing, for example, similarities between flatfish and man and labeling ambiguities validated in a human study. In summary, by uncovering real confusions, GRAPHIC offers new perspectives on how neural networks learn. The code is available at https://github.com/Johanna‑S‑Froehlich/GRAPHIC.

Authors:Xinyu Yuan, Xixian Liu, Ya Shi Zhang, Zuobai Zhang, Hongyu Guo, Jian Tang
Title: PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling
Abstract:
Building Virtual Cells that can accurately simulate cellular responses to perturbations is a long‑standing goal in systems biology. A fundamental challenge is that high‑throughput single‑cell sequencing is destructive: the same cell cannot be observed both before and after a perturbation. Thus, perturbation prediction requires mapping unpaired control and perturbed populations. Existing models address this by learning maps between distributions, but typically assume a single fixed response distribution when conditioned on observed cellular context (e.g., cell type) and the perturbation type. In reality, responses vary systematically due to unobservable latent factors such as microenvironmental fluctuations and complex batch effects, forming a manifold of possible distributions for the same observed conditions. To account for this variability, we introduce PerturbDiff, which shifts modeling from individual cells to entire distributions. By embedding distributions as points in a Hilbert space, we define a diffusion‑based generative process operating directly over probability distributions. This allows PerturbDiff to capture population‑level response shifts across hidden factors. Benchmarks on established datasets show that PerturbDiff achieves state‑of‑the‑art performance in single‑cell response prediction and generalizes substantially better to unseen perturbations. See our project page (https://katarinayuan.github.io/PerturbDiff‑ProjectPage/), where code and data will be made publicly available (https://github.com/DeepGraphLearning/PerturbDiff).

Authors:Jiayu Wang, Yifei Ming, Zixuan Ke, Shafiq Joty, Aws Albarghouthi, Frederic Sala
Title: SkillOrchestra: Learning to Route Agents via Skill Transfer
Abstract:
Compound AI systems promise capabilities beyond those of individual models, yet their success depends critically on effective orchestration. Existing routing approaches face two limitations: (1) input‑level routers make coarse query‑level decisions that ignore evolving task requirements; (2) RL‑trained orchestrators are expensive to adapt and often suffer from routing collapse, repeatedly invoking one strong but costly option in multi‑turn scenarios. We introduce SkillOrchestra, a framework for skill‑aware orchestration. Instead of directly learning a routing policy end‑to‑end, SkillOrchestra learns fine‑grained skills from execution experience and models agent‑specific competence and cost under those skills. At deployment, the orchestrator infers the skill demands of the current interaction and selects agents that best satisfy them under an explicit performance‑cost trade‑off. Extensive experiments across ten benchmarks demonstrate that SkillOrchestra outperforms SoTA RL‑based orchestrators by up to 22.5% with 700x and 300x learning cost reduction compared to Router‑R1 and ToolOrchestra, respectively. These results show that explicit skill modeling enables scalable, interpretable, and sample‑efficient orchestration, offering a principled alternative to data‑intensive RL‑based approaches. The code is available at: https://github.com/jiayuww/SkillOrchestra.

Authors:Luhan Tang, Longxuan Yu, Shaorong Zhang, Greg Ver Steeg
Title: Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models
Abstract:
Discrete diffusion language models (dLLMs) provide a fast and flexible alternative to autoregressive models (ARMs) via iterative denoising with parallel updates. However, their evaluation is challenging: existing metrics conflate denoiser approximation error with sampler‑induced error from the sampling dynamics, a problem that does not arise for ARMs whose autoregressive sampling exactly reflects the learned probability model. We introduce a sampler‑centric oracle framework that replaces learned denoisers with an exact Hidden Markov Model posterior derived from a ground‑truth Markov chain, isolating sampler‑induced error in a controlled setting. We show that few‑step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition‑level mismatch that vanishes only as the number of steps approaches the sequence length. Moreover, improvements in negative log‑likelihood, generative perplexity, or MAUVE do not imply correct sampling. Code is available at https://luhantang.github.io/dllm_sampler

Authors:Jeremy McEntire
Title: Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training
Abstract:
We introduce Leap+Verify, a framework that applies speculative execution ‑‑ predicting future model weights and validating predictions before acceptance ‑‑ to accelerate neural network training. Inspired by speculative decoding in language model inference and by the Automatically Scalable Computation (ASC) architecture for program execution, Leap+Verify decomposes training into three dynamically detected regimes (chaotic, transition, stable) using activation‑space cosine similarity as a real‑time Lyapunov proxy signal. Within each regime, analytic weight predictors (momentum, linear, quadratic extrapolation) attempt to forecast model parameters K training steps ahead; predictions are accepted only when validated against a held‑out loss criterion. We evaluate Leap+Verify on GPT‑2 124M and Qwen 2.5‑1.5B trained on WikiText‑103 across five random seeds, sweeping prediction depth K in 5, 10, 25, 50, 75, 100. Momentum‑based prediction (Adam moment extrapolation) fails catastrophically at both scales, with predicted losses exceeding actuals by 100‑10,000x ‑‑ a universal norm explosion in optimizer‑state extrapolation. Finite‑difference predictors (linear, quadratic) succeed where momentum fails: at 124M, they achieve 24% strict acceptance at K=5 in stable regimes; at 1.5B, they achieve 37% strict acceptance in transition regimes. The scale‑dependent finding is in regime distribution: GPT‑2 124M spends 34% of training in stable regime, while Qwen 1.5B spends 64% in chaotic regime and reaches stable in only 0‑2 of 40 checkpoints. Larger models are more predictable when predictable, but less often predictable ‑‑ the practical bottleneck shifts from predictor accuracy to regime availability. Cross‑seed results are highly consistent (less than 1% validation loss variance), and the three‑regime framework produces identical phase boundaries (plus or minus 50 steps) across seeds.

Authors:Pengxi Liu, Zeyu Michael Li, Xiang Cheng
Title: Variational Trajectory Optimization of Anisotropic Diffusion Schedules
Abstract:
We introduce a variational framework for diffusion models with anisotropic noise schedules parameterized by a matrix‑valued path M_t(θ) that allocates noise across subspaces. Central to our framework is a trajectory‑level objective that jointly trains the score network and learns M_t(θ), which encompasses general parameterization classes of matrix‑valued noise schedules. We further derive an estimator for the derivative with respect to θ of the score that enables efficient optimization of the M_t(θ) schedule. For inference, we develop an efficiently‑implementable reverse‑ODE solver that is an anisotropic generalization of the second‑order Heun discretization algorithm. Across CIFAR‑10, AFHQv2, FFHQ, and ImageNet‑64, our method consistently improves upon the baseline EDM model in all NFE regimes. Code is available at https://github.com/lizeyu090312/anisotropic‑diffusion‑paper.

Authors:Arjun Chatterjee, Sayeed Sajjad Razin, John Wu, Siddhartha Laghuvarapu, Jathurshan Pradeepkumar, Jimeng Sun
Title: Making Conformal Predictors Robust in Healthcare Settings: a Case Study on EEG Classification
Abstract:
Quantifying uncertainty in clinical predictions is critical for high‑stakes diagnosis tasks. Conformal prediction offers a principled approach by providing prediction sets with theoretical coverage guarantees. However, in practice, patient distribution shifts violate the i.i.d. assumptions underlying standard conformal methods, leading to poor coverage in healthcare settings. In this work, we evaluate several conformal prediction approaches on EEG seizure classification, a task with known distribution shift challenges and label uncertainty. We demonstrate that personalized calibration strategies can improve coverage by over 20 percentage points while maintaining comparable prediction set sizes. Our implementation is available via PyHealth, an open‑source healthcare AI framework: https://github.com/sunlabuiuc/PyHealth.

Authors:Pao-Hsiung Chiu, Jian Cheng Wong, Chin Chun Ooi, Chang Wei, Yuchen Fan, Yew-Soon Ong
Title: Scale-PINN: Learning Efficient Physics-Informed Neural Networks Through Sequential Correction
Abstract:
Physics‑informed neural networks (PINNs) have emerged as a promising mesh‑free paradigm for solving partial differential equations, yet adoption in science and engineering is limited by slow training and modest accuracy relative to modern numerical solvers. We introduce the Sequential Correction Algorithm for Learning Efficient PINN (Scale‑PINN), a learning strategy that bridges modern physics‑informed learning with numerical algorithms. Scale‑PINN incorporates the iterative residual‑correction principle, a cornerstone of numerical solvers, directly into the loss formulation, marking a paradigm shift in how PINN losses can be conceived and constructed. This integration enables Scale‑PINN to achieve unprecedented convergence speed across PDE problems from different physics domain, including reducing training time on a challenging fluid‑dynamics problem for state‑of‑the‑art PINN from hours to sub‑2 minutes while maintaining superior accuracy, and enabling application to representative problems in aerodynamics and urban science. By uniting the rigor of numerical methods with the flexibility of deep learning, Scale‑PINN marks a significant leap toward the practical adoption of PINNs in science and engineering through scalable, physics‑informed learning. Codes are available at https://github.com/chiuph/SCALE‑PINN.

Authors:Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, Xunliang Cai
Title: How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for Large Language Model (LLM) reasoning, yet current methods face key challenges in resource allocation and policy optimization dynamics: (i) uniform rollout allocation ignores gradient variance heterogeneity across problems, and (ii) the softmax policy structure causes gradient attenuation for high‑confidence correct actions, while excessive gradient updates may destabilize training. Therefore, we propose DynaMO, a theoretically‑grounded dual‑pronged optimization framework. At the sequence level, we prove that uniform allocation is suboptimal and derive variance‑minimizing allocation from the first principle, establishing Bernoulli variance as a computable proxy for gradient informativeness. At the token level, we develop gradient‑aware advantage modulation grounded in theoretical analysis of gradient magnitude bounds. Our framework compensates for gradient attenuation of high‑confidence correct actions while utilizing entropy changes as computable indicators to stabilize excessive update magnitudes. Extensive experiments conducted on a diverse range of mathematical reasoning benchmarks demonstrate consistent improvements over strong RLVR baselines. Our implementation is available at: \hrefhttps://anonymous.4open.science/r/dynamo‑680E/README.mdhttps://anonymous.4open.science/r/dynamo.

Authors:Saba Kublashvili
Title: Virtual Parameter Sharpening: Dynamic Low-Rank Perturbations for Inference-Time Reasoning Enhancement
Abstract:
I introduce Virtual Parameter Sharpening (VPS), an inference‑time technique that augments frozen transformer linear layers with dynamic, activation‑conditioned low‑rank perturbations. Unlike parameter‑efficient fine‑tuning methods such as LoRA, which learn static low‑rank adapters, VPS constructs its perturbation factors on the fly from batch activation statistics and optional gradient signals, enabling test‑time adaptation without persistent parameter updates. The perturbation takes the form Delta W = gamma W^T V U^T W, where selector matrices U and V are constructed via sparse activation‑guided selection or Sylvester‑coupled regression. We provide a theoretical analysis of the perturbation's spectral properties and describe an adaptive policy system that modulates perturbation magnitude based on activation energy and token‑level entropy. This system incorporates multi‑objective verification with iterative refinement for tasks with ground‑truth supervision. We present the complete algorithmic framework, analyze its mathematical foundations, and discuss the mechanisms by which activation‑conditioned computation may enhance reasoning capabilities in large language models. Implementation and experimental code are available at https://github.com/Saba‑Kublashvili/vps‑virtual‑parameter‑synthesis .

Authors:Qi Sun, Can Wang, Jiaxiang Shang, Yingchun Liu, Jing Liao
Title: Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
Abstract:
Current 3D human animation methods struggle to achieve photorealism: kinematics‑based approaches lack non‑rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non‑rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics‑based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion from residual non‑rigid motion. Rigid motion is generated by a kinematic method, which then produces a coarse rendering to guide the video diffusion model in generating video sequences that restore the residual non‑rigid motion. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out‑of‑distribution, causing standard deterministic ODE samplers to fail. Therefore, we propose a novel self‑guided stochastic sampling method, which effectively addresses the out‑of‑distribution problem by combining stochastic sampling (for photorealistic quality) with self‑guidance (for identity fidelity). These restored videos provide high‑quality supervision, enabling the optimization of the residual non‑rigid motion field. Extensive experiments demonstrate that \MethodName can generate photorealistic 3D human animation, outperforming existing methods. Code is available in https://github.com/qiisun/ani3dhuman.

Authors:Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li
Title: IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Abstract:
Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference‑time costs. We revisit token‑efficient post‑training and argue that existing sequence‑level reward‑shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information‑theoretic post‑training framework that assigns token‑wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low‑utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token‑efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information‑aware advantage shaping is a powerful and general direction for token‑efficient post‑training. The code is available at https://github.com/YinhanHe123/IAPO.

Authors:Philip Mortimer, Cristiana Diaconu, Tommy Rochussen, Bruno Mlodozeniec, Richard E. Turner
Title: Incremental Transformer Neural Processes
Abstract:
Neural Processes (NPs), and specifically Transformer Neural Processes (TNPs), have demonstrated remarkable performance across tasks ranging from spatiotemporal forecasting to tabular data modelling. However, many of these applications are inherently sequential, involving continuous data streams such as real‑time sensor readings or database updates. In such settings, models should support cheap, incremental updates rather than recomputing internal representations from scratch for every new observation ‑‑ a capability existing TNP variants lack. Drawing inspiration from Large Language Models, we introduce the Incremental TNP (incTNP). By leveraging causal masking, Key‑Value (KV) caching, and a data‑efficient autoregressive training strategy, incTNP matches the predictive performance of standard TNPs while reducing the computational cost of updates from quadratic to linear time complexity. We empirically evaluate our model on a range of synthetic and real‑world tasks, including tabular regression and temperature prediction. Our results show that, surprisingly, incTNP delivers performance comparable to ‑‑ or better than ‑‑ non‑causal TNPs while unlocking orders‑of‑magnitude speedups for sequential inference. Finally, we assess the consistency of the model's updates ‑‑ by adapting a metric of ``implicit Bayesianness", we show that incTNP retains a prediction rule as implicitly Bayesian as standard non‑causal TNPs, demonstrating that incTNP achieves the computational benefits of causal masking without sacrificing the consistency required for streaming inference.

Authors:Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen
Title: [b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
Abstract:
Self‑supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic‑arithmetic .

Authors:Ziheng Chen, Bernhard Schölkopf, Nicu Sebe
Title: Hyperbolic Busemann Neural Networks
Abstract:
Hyperbolic spaces provide a natural geometry for representing hierarchical and tree‑structured data due to their exponential volume growth. To leverage these benefits, neural networks require intrinsic and efficient components that operate directly in hyperbolic space. In this work, we lift two core components of neural networks, Multinomial Logistic Regression (MLR) and Fully Connected (FC) layers, into hyperbolic space via Busemann functions, resulting in Busemann MLR (BMLR) and Busemann FC (BFC) layers with a unified mathematical interpretation. BMLR provides compact parameters, a point‑to‑horosphere distance interpretation, batch‑efficient computation, and a Euclidean limit, while BFC generalizes FC and activation layers with comparable complexity. Experiments on image classification, genome sequence learning, node classification, and link prediction demonstrate improvements in effectiveness and efficiency over prior hyperbolic layers. The code is available at https://github.com/GitZH‑Chen/HBNN.

Authors:Jiayi Li, Zhaonan Wang, Flora D. Salim
Title: SGNO: Spectral Generator Neural Operators for Stable Long Horizon PDE Rollouts
Abstract:
Neural operators provide fast PDE surrogates and often generalize across parameters and resolutions. However, in the short train long test setting, autoregressive rollouts can become unstable. This typically happens for two reasons: one step errors accumulate over time, and high frequency components feed back and grow. We introduce the Spectral Generator Neural Operator (SGNO), a residual time stepper that targets both effects. For the linear part, SGNO uses an exponential time differencing update in Fourier space with a learned diagonal generator. We constrain the real part of this generator to be nonpositive, so iterating the step does not amplify the linear dynamics. For nonlinear dynamics, SGNO adds a gated forcing term with channel mixing within each Fourier mode, which keeps the nonlinear update controlled. To further limit high frequency feedback, SGNO applies spectral truncation and an optional smooth mask on the forcing pathway. We derive a one step amplification bound and a finite horizon rollout error bound. The bound separates generator approximation error from nonlinear mismatch and gives sufficient conditions under which the latent L^2 norm does not grow across rollout steps. On APEBench spanning 1D, 2D, and 3D PDE families, SGNO achieves lower long horizon error and longer stable rollout lengths than strong neural operator baselines. Ablations confirm the roles of the generator constraint, gating, and filtering.The code is available at https://github.com/lijy32123‑cloud/SGNO.

Authors:Osman Onur Kuzucu, Tunca Doğan
Title: GLaDiGAtor: Language-Model-Augmented Multi-Relation Graph Learning for Predicting Disease-Gene Associations
Abstract:
Understanding disease‑gene associations is essential for unravelling disease mechanisms and advancing diagnostics and therapeutics. Traditional approaches based on manual curation and literature review are labour‑intensive and not scalable, prompting the use of machine learning on large biomedical data. In particular, graph neural networks (GNNs) have shown promise for modelling complex biological relationships. To address limitations in existing models, we propose GLaDiGAtor (Graph Learning‑bAsed DIsease‑Gene AssociaTiOn pRediction), a novel GNN framework with an encoder‑decoder architecture for disease‑gene association prediction. GLaDiGAtor constructs a heterogeneous biological graph integrating gene‑gene, disease‑disease, and gene‑disease interactions from curated databases, and enriches each node with contextual features from well‑known language models (ProtT5 for protein sequences and BioBERT for disease text). In evaluations, our model achieves superior predictive accuracy and generalisation, outperforming 14 existing methods. Literature‑supported case studies confirm the biological relevance of high‑confidence novel predictions, highlighting GLaDiGAtor's potential to discover candidate disease genes. These results underscore the power of graph convolutional networks in biomedical informatics and may ultimately facilitate drug discovery by revealing new gene‑disease links. The source code and processed datasets are publicly available at https://github.com/HUBioDataLab/GLaDiGAtor.

Authors:Nikolaos Kougioulis, Nikolaos Gkorgkolis, MingXue Wang, Bora Caglayan, Dario Simionato, Andrea Tonon, Ioannis Tsamardinos
Title: Large Causal Models for Temporal Causal Discovery
Abstract:
Causal discovery for both cross‑sectional and temporal data has traditionally followed a dataset‑specific paradigm, where a new model is fitted for each individual dataset. Such an approach limits the potential of multi‑dataset pretraining. The concept of large causal models (LCMs) envisions a class of pre‑trained neural architectures specifically designed for temporal causal discovery. Prior approaches are constrained to small variable counts, degrade with larger inputs, and rely heavily on synthetic data, limiting generalization. We propose a principled framework for LCMs, combining diverse synthetic generators with realistic time‑series datasets, allowing learning at scale. Extensive experiments on synthetic, semi‑synthetic and realistic benchmarks show that LCMs scale effectively to higher variable counts and deeper architectures while maintaining strong performance. Trained models achieve competitive or superior accuracy compared to classical and neural baselines, particularly in out‑of‑distribution settings, while enabling fast, single‑pass inference. Results demonstrate LCMs as a promising foundation‑model paradigm for temporal causal discovery. Experiments and model weights are available at https://github.com/kougioulis/LCM‑paper/.

Authors:Yanlin Zhang, Linjie Xu, Quan Gan, David Wipf, Minjie Wang
Title: RDBLearn: Simple In-Context Prediction Over Relational Databases
Abstract:
Recent advances in tabular in‑context learning (ICL) show that a single pretrained model can adapt to new prediction tasks from a small set of labeled examples, avoiding per‑task training and heavy tuning. However, many real‑world tasks live in relational databases, where predictive signal is spread across multiple linked tables rather than a single flat table. We show that tabular ICL can be extended to relational prediction with a simple recipe: automatically featurize each target row using relational aggregations over its linked records, materialize the resulting augmented table, and run an off‑the‑shelf tabular foundation model on it. We package this approach in RDBLearn (https://github.com/HKUSHXLab/rdblearn), an easy‑to‑use toolkit with a scikit‑learn‑style estimator interface that makes it straightforward to swap different tabular ICL backends; a complementary agent‑specific interface is provided as well. Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best‑performing foundation model approach we evaluate, at times even outperforming strong supervised baselines trained or fine‑tuned on each dataset.

Authors:Nahom Birhan, Daniel Wesego, Dereje Shenkut, Frank Liu, Daniel Takabi
Title: DCInject: Persistent Backdoor Attacks via Frequency Manipulation in Personal Federated Learning
Abstract:
Personalized federated learning (PFL) creates client‑specific models to handle data heterogeneity. Previously, PFL has been shown to be naturally resistant to backdoor attack propagation across clients. In this work, we reveal that PFL remains vulnerable to backdoor attacks through a novel frequency‑domain approach. We propose DCInject, an adaptive frequency‑domain backdoor attack for PFL, which removes portions of the zero‑frequency (DC) component and replaces them with Gaussian‑distributed samples in the frequency domain. Our attack achieves superior attack success rates while maintaining clean accuracy across four datasets (CIFAR‑10/100, GTSRB, SVHN) compared to existing spatial‑domain attacks, evaluated under parameter decoupling based personalization. DCInject achieves superior performance with ASRs of 96.83% (CIFAR‑10), 99.38% (SVHN), and 100% (GTSRB) while maintaining clean accuracy. Under I‑BAU defense, DCInject demonstrates strong persistence, retaining 90.30% ASR vs BadNet's 58.56% on VGG‑16, exposing critical vulnerabilities in PFL security assumptions. Our code is available at https://github.com/NahomMA/DCINject‑PFL

Authors:Guoqi Yu, Juncheng Wang, Chen Yang, Jing Qin, Angelica I. Aviles-Rivero, Shujun Wang
Title: Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series
Abstract:
Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibit two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer‑based models to effectively capture temporal dependencies, they often struggle with modeling channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To address this mismatch, we propose CoTAR (Core Token Aggregation‑Redistribution), a centralized MLP‑based module designed to replace decentralized attention. Instead of allowing all tokens to interact directly, as in standard attention, CoTAR introduces a global core token that serves as a proxy to facilitate inter‑token interactions, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 12.13% improvement on the APAVA dataset, while using only 33% of the memory and 20% of the inference time compared to the previous state of the art. Code and all training scripts are available at https://github.com/Levi‑Ackman/TeCh.

Authors:Sanjeev Panta, Xu Yuan, Li Chen, Nian-Feng Tzeng
Title: Revisiting the Seasonal Trend Decomposition for Enhanced Time Series Forecasting
Abstract:
Time series forecasting presents significant challenges in real‑world applications across various domains. Building upon the decomposition of the time series, we enhance the architecture of machine learning models for better multivariate time series forecasting. To achieve this, we focus on the trend and seasonal components individually and investigate solutions to predict them with less errors. Recognizing that reversible instance normalization is effective only for the trend component, we take a different approach with the seasonal component by directly applying backbone models without any normalization or scaling procedures. Through these strategies, we successfully reduce error values of the existing state‑of‑the‑art models and finally introduce dual‑MLP models as more computationally efficient solutions. Furthermore, our approach consistently yields positive results with around 10% MSE average reduction across four state‑of‑the‑art baselines on the benchmark datasets. We also evaluate our approach on a hydrological dataset extracted from the United States Geological Survey (USGS) river stations, where our models achieve significant improvements while maintaining linear time complexity, demonstrating real‑world effectiveness. The source code is available at https://github.com/Sanjeev97/Time‑Series‑Decomposition

Authors:Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, Chenhao Tan
Title: The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research
Abstract:
Reproducibility crises across sciences highlight the limitations of the paper‑centric review system in assessing the rigor and reproducibility of research. AI agents that autonomously design and generate large volumes of research outputs exacerbate these challenges. In this work, we address the growing challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators. We propose the first execution‑grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper. We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings. We show that our framework achieves above 80% agreement with human judges, identifies substantial methodological problems, and surfaces 51 additional issues that human reviewers miss. Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.

Authors:Geri Skenderi, Lorenzo Buffoni, Francesco D'Amico, David Machado, Raffaele Marino, Matteo Negri, Federico Ricci-Tersenghi, Carlo Lucibello, Maria Chiara Angelini
Title: Benchmarking Graph Neural Networks in Solving Hard Constraint Satisfaction Problems
Abstract:
Graph neural networks (GNNs) are increasingly applied to hard optimization problems, often claiming superiority over classical heuristics. However, such claims risk being unsolid due to a lack of standard benchmarks on truly hard instances. From a statistical physics perspective, we propose new hard benchmarks based on random problems. We provide these benchmarks, along with performance results from both classical heuristics and GNNs. Our fair comparison shows that classical algorithms still outperform GNNs. We discuss the challenges for neural networks in this domain. Future claims of superiority can be made more robust using our benchmarks, available at https://github.com/ArtLabBocconi/RandCSPBench.

Authors:Minh Dinh, Stéphane Deny
Title: Latent Equivariant Operators for Robust Object Recognition: Promises and Challenges
Abstract:
Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group‑symmetric transformations rarely seen during training\unicodex2013for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to learn equivariant operators in a latent space, from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out‑of‑distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets. Our code is available at https://github.com/BRAIN‑Aalto/equivariant_operator.

Authors:Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig
Title: VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Abstract:
Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM‑based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition‑rich codebases with substantial project‑specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open‑source formal‑methods developments and packaged to preserve realistic repository context and cross‑file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib‑style mathematics transfer poorly to this repository‑centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi‑hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia‑group/VeriSoftBench.

Authors:Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He
Title: PRISM: Parallel Reward Integration with Symmetry for MORL
Abstract:
This work studies heterogeneous Multi‑Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long‑horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory‑motivated model that reconciles temporal‑frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection‑equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse‑reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100% over the baseline and up to 32% over the oracle. The code is at \hrefhttps://github.com/EVIEHub/PRISMhttps://github.com/EVIEHub/PRISM.

Authors:Aaron Louis Eidt, Nils Feldhus
Title: Simplifying Outcomes of Language Model Component Analyses with ELIA
Abstract:
While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques ‑‑ Attribution Analysis, Function Vector Analysis, and Circuit Tracing ‑‑ and introduces a novel methodology: using a vision‑language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed‑methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI‑powered explanations helped bridge the knowledge gap for non‑experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user‑centered design that prioritizes interactivity, specificity, and narrative guidance.

Authors:Jorge Carrasco Pollo, Ioannis Kapetangeorgis, Joshua Rosenthal, John Hua Yao
Title: [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games
Abstract:
Large Language Models (LLMs) demonstrate significant potential in multi‑agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model‑comparative evaluations.

Authors:Yuankai Luo, Woping Chen, Tong Liang, Baiqiao Wang, Zhenguo Li
Title: SimVLA: A Simple VLA Baseline for Robotic Manipulation
Abstract:
Vision‑Language‑Action (VLA) models have emerged as a promising paradigm for general‑purpose robotic manipulation, leveraging large‑scale pre‑training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision‑language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state‑of‑the‑art performance. Despite having only 0.5B parameters, SimVLA outperforms multi‑billion‑parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on‑par real‑robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: https://frontierrobo.github.io/SimVLA

Authors:Joseph Bingham, Netanel Arussy, Dvir Aran
Title: SOMtime the World Ain$'$t Fair: Violating Fairness Using Self-Organizing Maps
Abstract:
Unsupervised representations are widely assumed to be neutral with respect to sensitive attributes when those attributes are withheld from training. We show that this assumption is false. Using SOMtime, a topology‑preserving representation method based on high‑capacity Self‑Organizing Maps, we demonstrate that sensitive attributes such as age and income emerge as dominant latent axes in purely unsupervised embeddings, even when explicitly excluded from the input. On two large‑scale real‑world datasets (the World Values Survey across five countries and the Census‑Income dataset), SOMtime recovers monotonic orderings aligned with withheld sensitive attributes, achieving Spearman correlations of up to 0.85, whereas PCA and UMAP typically remain below 0.23 (with a single exception reaching 0.31), and against t‑SNE and autoencoders which achieve at most 0.34. Furthermore, unsupervised segmentation of SOMtime embeddings produces demographically skewed clusters, demonstrating downstream fairness risks without any supervised task. These findings establish that fairness through unawareness fails at the representation level for ordinal sensitive attributes and that fairness auditing must extend to unsupervised components of machine learning pipelines. We have made the code available at~ https://github.com/JosephBingham/SOMtime

Authors:Narjes Nourzad, Carlee Joe-Wong
Title: MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance
Abstract:
Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory‑Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision‑relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high‑return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real‑time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial LLM‑derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility‑based shaping improves early‑stage learning in sparse‑reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: https://narjesno.github.io/MIRA/

Authors:Athanasios Angelakis
Title: ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging
Abstract:
Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge‑deployed clinical systems. We introduce ZACH‑ViT (Zero‑token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term "Zero‑token" specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi‑class tasks under a strict few‑shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime‑dependent behavior: ZACH‑ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH‑ViT achieves competitive performance while maintaining sub‑second inference times, supporting deployment in resource‑constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH‑ViT.

Authors:Adrian Catalin Lutu, Eduard Poesina, Radu Tudor Ionescu
Title: VQPP: Video Query Performance Prediction Benchmark
Abstract:
Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content‑based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text‑to‑video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre‑retrieval and post‑retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre‑retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre‑retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at https://github.com/AdrianLutu/VQPP.

Authors:Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies
Title: CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
Abstract:
Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text‑to‑hand‑motion generation or hand animation captioning rely on studio‑captured datasets with limited actions and contexts, making them costly to scale to "in‑the‑wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text‑motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D‑HIW), a dataset of 32K 3D hand‑motion sequences and aligned text, and (2) propose CLUTCH, an LLM‑based hand animation system with two critical innovations: (a) SHIFT, a novel VQ‑VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D‑HIW, we propose a data annotation pipeline that combines vision‑language models (VLMs) and state‑of‑the‑art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in‑the‑wild, CLUTCH employs SHIFT, a part‑modality decomposed VQ‑VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co‑supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state‑of‑the‑art performance on text‑to‑motion and motion‑to‑text tasks, establishing the first benchmark for scalable in‑the‑wild hand motion modelling. Code, data and models will be released.

Authors:Ziyuan Liu, Shizhao Sun, Danqing Huang, Yingdong Shi, Meisheng Zhang, Ji Li, Jingsong Yu, Jiang Bian
Title: DesignAsCode: Bridging Structural Editability and Visual Fidelity in Graphic Design Generation
Abstract:
Graphic design generation demands a delicate balance between high visual fidelity and fine‑grained structural editability. However, existing approaches typically bifurcate into either non‑editable raster image synthesis or abstract layout generation devoid of visual content. Recent combinations of these two approaches attempt to bridge this gap but often suffer from rigid composition schemas and unresolvable visual dissonances (e.g., text‑background conflicts) due to their inexpressive representation and open‑loop nature. To address these challenges, we propose DesignAsCode, a novel framework that reimagines graphic design as a programmatic synthesis task using HTML/CSS. Specifically, we introduce a Plan‑Implement‑Reflect pipeline, incorporating a Semantic Planner to construct dynamic, variable‑depth element hierarchies and a Visual‑Aware Reflection mechanism that iteratively optimizes the code to rectify rendering artifacts. Extensive experiments demonstrate that DesignAsCode significantly outperforms state‑of‑the‑art baselines in both structural validity and aesthetic quality. Furthermore, our code‑native representation unlocks advanced capabilities, including automatic layout retargeting, complex document generation (e.g., resumes), and CSS‑based animation. Our project page is available at https://liuziyuan1109.github.io/design‑as‑code/.

Authors:Irene Iele, Giulia Romoli, Daniele Molino, Elena Mulero Ayllón, Filippo Ruffini, Paolo Soda, Matteo Tortora
Title: Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates
Abstract:
Accurate short‑term forecasting of vegetation dynamics is a key enabler for data‑driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling caused by cloud coverage, as well as the heterogeneous climatic conditions under which crops evolve. In this work, we propose a probabilistic forecasting framework specifically designed for field‑level NDVI prediction under clear‑sky acquisition constraints. The method leverages a transformer‑based architecture that explicitly separates the modeling of historical vegetation dynamics from future exogenous information, integrating historical NDVI observations with both historical and future meteorological covariates. To address irregular revisit patterns and horizon‑dependent uncertainty, we introduce a temporal‑distance weighted quantile loss that aligns the training objective with the effective forecasting horizon. In addition, we incorporate cumulative and extreme‑weather feature engineering to better capture delayed meteorological effects relevant to vegetation response. Extensive experiments on European satellite data demonstrate that the proposed approach consistently outperforms a diverse set of statistical, deep learning, and recent time series baselines across both point‑wise and probabilistic evaluation metrics. Ablation studies further highlight the central role of target history, while showing that meteorological covariates provide complementary gains when jointly exploited. The code is available at https://github.com/arco‑group/ndvi‑forecasting.

Authors:Peng Sun, Xinyi Shang, Tao Lin, Zhiqiang Shen
Title: Duality Models: An Embarrassingly Simple One-step Generation Paradigm
Abstract:
Consistency‑based generative models like Shortcut and MeanFlow achieve impressive results via a target‑aware design for solving the Probability Flow ODE (PF‑ODE). Typically, such methods introduce a target time r alongside the current time t to modulate outputs between a local multi‑step derivative (r = t) and a global few‑step integral (r = 0). However, the conventional "one input, one output" paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi‑step objective for stability. This separation forces a trade‑off: allocating sufficient samples to the multi‑step objective leaves the few‑step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a "one input, dual output" paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity v_t and flow‑map u_t from a single input x_t. This applies geometric constraints from the multi‑step objective to every sample, bounding the few‑step estimation without separating training objectives, thereby significantly improving stability and efficiency. On ImageNet 256 × 256, a 679M Diffusion Transformer with SD‑VAE achieves a state‑of‑the‑art (SOTA) FID of 1.79 in just 2 steps. Code is available at: https://github.com/LINs‑lab/DuMo

Authors:Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen
Title: Sink-Aware Pruning for Diffusion Language Models
Abstract:
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention‑sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose \bf \textttSink‑Aware Pruning, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality‑efficiency trade‑off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA‑Lab/Sink‑Aware‑Pruning.

Authors:Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen
Title: Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Abstract:
Black‑box adversarial attacks on Large Vision‑Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state‑of‑the‑art transfer‑based approaches like M‑Attack perform well using local crop‑level matching between source and target images, we find this induces high‑variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike‑like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient‑denoising upgrade to M‑Attack. On the source side, Multi‑Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower‑variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch‑size ensemble (PE+), this strengthens transferable directions. Together these modules form M‑Attack‑V2, a simple, modular enhancement over M‑Attack that substantially improves transfer‑based black‑box attacks on frontier LVLMs: boosting success rates on Claude‑4.0 from 8% to 30%, Gemini‑2.5‑Pro from 83% to 97%, and GPT‑5 from 98% to 100%, outperforming prior black‑box LVLM attacks. Code and data are publicly available at: https://github.com/vila‑lab/M‑Attack‑V2.

Authors:Yaoyue Zheng, Yin Zhang, Joost van de Weijer, Gido M van de Ven, Shaoyi Du, Xuetao Zhang, Zhiqiang Tian
Title: Revisiting Weight Regularization for Low-Rank Continual Learning
Abstract:
Continual Learning (CL) with large‑scale pre‑trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter‑efficient continual learning (PECL), where task interference is typically mitigated by assigning a task‑specific module during training, such as low‑rank adapters. However, weight regularization techniques, such as Elastic Weight Consolidation (EWC)‑a key strategy in CL‑remain underexplored in this new paradigm. In this paper, we revisit weight regularization in low‑rank CL as a new perspective for mitigating task interference in PECL. Unlike existing low‑rank CL methods, we mitigate task interference by regularizing a shared low‑rank update through EWC, thereby keeping the storage requirement and inference costs constant regardless of the number of tasks. Our proposed method EWC‑LoRA leverages a low‑rank representation to estimate parameter importance over the full‑dimensional space. This design offers a practical, computational‑ and memory‑efficient solution for CL with PTMs, and provides insights that may inform the broader application of regularization techniques within PECL. Extensive experiments on various benchmarks demonstrate the effectiveness of EWC‑LoRA, achieving a stability‑plasticity trade‑off superior to existing low‑rank CL approaches. These results indicate that, even under low‑rank parameterizations, weight regularization remains an effective mechanism for mitigating task interference. Code is available at: https://github.com/yaoyz96/low‑rank‑cl.

Authors:Masahiro Kato
Title: genriesz: A Python Package for Automatic Debiased Machine Learning with Generalized Riesz Regression
Abstract:
Efficient estimation of causal and structural parameters can be automated using the Riesz representation theorem and debiased machine learning (DML). We present genriesz, an open‑source Python package that implements automatic DML and generalized Riesz regression, a unified framework for estimating Riesz representers by minimizing empirical Bregman divergences. This framework includes covariate balancing, nearest‑neighbor matching, calibrated estimation, and density ratio estimation as special cases. A key design principle of the package is automatic regressor balancing (ARB): given a Bregman generator g and a representer model class, genriesz automatically constructs a compatible link function so that the generalized Riesz regression estimator satisfies balancing (moment‑matching) optimality conditions in a user‑chosen basis. The package provides a modulr interface for specifying (i) the target linear functional via a black‑box evaluation oracle, (ii) the representer model via basis functions (polynomial, RKHS approximations, random forest leaf encodings, neural embeddings, and a nearest‑neighbor catchment basis), and (iii) the Bregman generator, with optional user‑supplied derivatives. It returns regression adjustment (RA), Riesz weighting (RW), augmented Riesz weighting (ARW), and TMLE‑style estimators with cross‑fitting, confidence intervals, and p‑values. We highlight representative workflows for estimation problems such as the average treatment effect (ATE), ATE on treated (ATT), and average marginal effect estimation. The Python package is available at https://github.com/MasaKat0/genriesz and on PyPI.

Authors:Peter Balogh
Title: The Anxiety of Influence: Bloom Filters in Transformer Attention Heads
Abstract:
Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT‑2 small, medium, and large; Pythia‑160M) and show that they form a spectrum of membership‑testing strategies. Two heads (L0H1 and L0H5 in GPT‑2 small) function as high‑precision membership filters with false positive rates of 0‑4% even at 180 unique context tokens ‑‑ well above the d_\texthead = 64 bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula p \approx (1 ‑ e^‑kn/m)^k with R^2 = 1.0 and fitted capacity m \approx 5 bits, saturating by n \approx 20 unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix‑attention head after confound controls revealed its apparent capacity curve was a sequence‑length artifact. Together, the three genuine membership‑testing heads form a multi‑resolution system concentrated in early layers (0‑1), taxonomically distinct from induction and previous‑token heads, with false positive rates that decay monotonically with embedding distance ‑‑ consistent with distance‑sensitive Bloom filters. These heads generalize broadly: they respond to any repeated token type, not just repeated names, with 43% higher generalization than duplicate‑token‑only heads. Ablation reveals these heads contribute to both repeated and novel token processing, indicating that membership testing coexists with broader computational roles. The reclassification of L3H0 through confound controls strengthens rather than weakens the case: the surviving heads withstand the scrutiny that eliminated a false positive in our own analysis.

Authors:Gurjeet Sangra Singh, Frantzeska Lavda, Giangiacomo Mercatali, Alexandros Kalousis
Title: Variational Grey-Box Dynamics Matching
Abstract:
Deep generative models such as flow matching and diffusion models have shown great potential in learning complex distributions and dynamical systems, but often act as black‑boxes, neglecting underlying physics. In contrast, physics‑based simulation models described by ODEs/PDEs remain interpretable, but may have missing or unknown terms, unable to fully describe real‑world observations. We bridge this gap with a novel grey‑box method that integrates incomplete physics models directly into generative models. Our approach learns dynamics from observational trajectories alone, without ground‑truth physics parameters, in a simulation‑free manner that avoids scalability and stability issues of Neural ODEs. The core of our method lies in modelling a structured variational distribution within the flow matching framework, by using two latent encodings: one to model the missing stochasticity and multi‑modal velocity, and a second to encode physics parameters as a latent variable with a physics‑informed prior. Furthermore, we present an adaptation of the framework to handle second‑order dynamics. Our experiments on representative ODE/PDE problems show that our method performs on par with or superior to fully data‑driven approaches and previous grey‑box baselines, while preserving the interpretability of the physics model. Our code is available at https://github.com/DMML‑Geneva/VGB‑DM.

Authors:Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik
Title: Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study
Abstract:
Uncertainty quantification has emerged as an effective approach to closed‑book hallucination detection for LLMs, but existing methods are largely designed for short‑form outputs and do not generalize well to long‑form generation. We introduce a taxonomy for fine‑grained uncertainty quantification in long‑form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit‑level scoring, and response‑level aggregation. We formalize several families of consistency‑based black‑box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim‑response entailment consistently performs better or on par with more complex claim‑level scorers, 2) claim‑level scoring generally yields better results than sentence‑level scoring, and 3) uncertainty‑aware decoding is highly effective for improving the factuality of long‑form outputs. Our framework clarifies relationships between prior methods, enables apples‑to‑apples comparisons, and provides practical guidance for selecting components for fine‑grained UQ.

Authors:Lorenzo Caselli, Marco Mistretta, Simone Magistri, Andrew D. Bagdanov
Title: SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
Abstract:
Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross‑modal image‑concept similarities as a unified cross‑modal representation. Each image is expressed as a mixture over semantic concepts from a large task‑agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross‑modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross‑modal representations of the student remain both semantically sufficient and well‑aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state‑of‑the‑art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.

Authors:Luzhi Wang, Xuanshuo Fu, He Zhang, Chuang Liu, Xiaobao Wang, Hongbo Liu
Title: From Subtle to Significant: Prompt-Driven Self-Improving Optimization in Test-Time Graph OOD Detection
Abstract:
Graph Out‑of‑Distribution (OOD) detection aims to identify whether a test graph deviates from the distribution of graphs observed during training, which is critical for ensuring the reliability of Graph Neural Networks (GNNs) when deployed in open‑world scenarios. Recent advances in graph OOD detection have focused on test‑time training techniques that facilitate OOD detection without accessing potential supervisory information (e.g., training data). However, most of these methods employ a one‑pass inference paradigm, which prevents them from progressively correcting erroneous predictions to amplify OOD signals. To this end, we propose a Self‑Improving Graph Out‑of‑Distribution detector (SIGOOD), which is an unsupervised framework that integrates continuous self‑learning with test‑time training for effective graph OOD detection. Specifically, SIGOOD generates a prompt to construct a prompt‑enhanced graph that amplifies potential OOD signals. To optimize prompts, SIGOOD introduces an Energy Preference Optimization (EPO) loss, which leverages energy variations between the original test graph and the prompt‑enhanced graph. By iteratively optimizing the prompt by involving it into the detection model in a self‑improving loop, the resulting optimal prompt‑enhanced graph is ultimately used for OOD detection. Comprehensive evaluations on 21 real‑world datasets confirm the effectiveness and outperformance of our SIGOOD method. The code is at https://github.com/Ee1s/SIGOOD.

Authors:Ron Shapira Weber, Oren Freifeld
Title: SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch
Abstract:
We present softdtw‑cuda‑torch, an open‑source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence‑length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti‑diagonal kernel execution that removes the sequence‑length constraint, (2) a log‑space back‑ward pass that prevents floating‑point overflow, and (3) a fused distance‑computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft‑DTW Barycenter computation. Code is available at https://github.com/BGU‑CS‑VIL/sdtw‑cuda‑torch.

Authors:Mingzhe Cui, Tao Chen, Yang Jiao, Yiqin Wang, Lei Xie, Yi Pan, Luca Mainardi
Title: BrainRVQ: A High-Fidelity EEG Foundation Model via Dual-Domain Residual Quantization and Hierarchical Autoregression
Abstract:
Developing foundation models for electroencephalography (EEG) remains challenging due to the signal's low signal‑to‑noise ratio and complex spectro‑temporal non‑stationarity. Existing approaches often overlook the hierarchical latent structure inherent in neural dynamics, leading to suboptimal reconstruction of fine‑grained information. In this work, we propose BrainRVQ, a general‑purpose EEG foundation model pre‑trained on a large‑scale corpus of clinical EEG data. Unlike standard masked modeling, BrainRVQ features a Dual‑Domain Residual Vector Quantization (DD‑RVQ) tokenizer that disentangles temporal waveforms and spectral patterns into hierarchical discrete codes. We further introduce a hierarchical autoregressive pre‑training objective that learns to reconstruct these codes in a coarse‑to‑fine manner, utilizing an importance‑guided curriculum masking strategy to prioritize information‑rich neural events over background noise. Extensive experiments across 8 diverse downstream datasets demonstrate that BrainRVQ consistently outperforms state‑of‑the‑art baselines, validating its effectiveness in learning robust and generalizable neural representations. Our code and model weights are available:https://github.com/keqicmz/BrainRVQ

Authors:Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj
Title: MALLVI: a multi agent framework for integrated generalized robotics manipulation
Abstract:
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.

Authors:Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi, Sathwik Acharya, Xingyao Wang, Carolyn Rose, Graham Neubig, Daniel Fried
Title: Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
Abstract:
When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE‑Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine‑grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid‑Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real‑world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE‑Bench Verified, 7.9% on SWT‑Bench Verified, and 5.1% on Commit‑0 Lite. Hybrid‑Gym also complements datasets built for the downstream tasks (e.g., improving SWE‑Play by 4.9% on SWT‑Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid‑Gym.

Authors:Pengqi Liu, Zijun Yu, Mouloud Belbahri, Arthur Charpentier, Masoud Asgharian, Jesse C. Cresswell
Title: Beyond Procedure: Substantive Fairness in Conformal Prediction
Abstract:
Conformal prediction (CP) offers distribution‑free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision‑making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision‑making pipeline to evaluate substantive fairness‑the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction‑set size disparity into interpretable components, clarifying how label‑clustered CP helps control method‑driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM‑in‑the‑loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments reveal that label‑clustered CP variants consistently deliver superior substantive fairness. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems. Our code is available at https://github.com/layer6ai‑labs/llm‑in‑the‑loop‑conformal‑fairness.

Authors:Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, Benyou Wang
Title: LiveClin: A Live Clinical Benchmark without Leakage
Abstract:
The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real‑world clinical practice. Built from contemporary, peer‑reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI‑human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real‑world scenarios, with the top‑performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human experts, Chief Physicians achieved the highest accuracy, followed closely by Attending Physicians, with both surpassing most models. LiveClin thus provides a continuously evolving, clinically grounded framework to guide the development of medical LLMs towards closing this gap and achieving greater reliability and real‑world utility. Our data and code are publicly available at https://github.com/AQ‑MedAI/LiveClin.

Authors:Zhangyi Liu, Huaizhi Qu, Xiaowei Yin, He Sun, Yanjun Han, Tianlong Chen, Zhun Deng
Title: PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency
Abstract:
Test‑time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample‑efficient test‑time self‑consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test‑TimeSelf‑Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self‑consistency rate, a new measure defined as agreement with the infinite‑budget majority vote. This formulation makes sample‑efficient test‑time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well‑developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority‑voting‑based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self‑consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at https://github.com/ZDCSlab/PETS.

Authors:Haoxiang Sun, Lizhen Xu, Bing Zhao, Wotao Yin, Wei Wang, Boyu Yang, Rui Wang, Hu Wei
Title: DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small‑scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce DeepVision‑103K, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision's effectiveness for advancing multimodal reasoning. Data: \hrefhttps://huggingface.co/datasets/skylenage/DeepVision‑103Kthis url.

Authors:Karan Bali, Jack Stanley, Praneet Suresh, Danilo Bzdok
Title: Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Abstract:
In mechanistic interpretability, recent work scrutinizes transformer "circuits" ‑ sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acid‑tested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety‑critical settings. Here, we systematically study stability across‑refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle‑layer heads are the least stable yet the most representationally distinct; (2) deeper models exhibit stronger mid‑depth divergence; (3) unstable heads in deeper layers become more functionally important than their peers from the same layer; (4) applying weight decay optimization substantially improves attention‑head stability across random model initializations; and (5) the residual stream is comparatively stable. Our findings establish the cross‑instance robustness of circuits as an essential yet underappreciated prerequisite for scalable oversight, drawing contours around possible white‑box monitorability of AI systems.

Authors:SungJun Cho, Chetan Gohil, Rukuang Huang, Oiwi Parker Jones, Mark W. Woolrich
Title: A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models
Abstract:
Recent success in natural language processing has motivated growing interest in large‑scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample‑level tokenization strategies for transformer‑based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non‑learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject‑specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non‑learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample‑level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA‑analysis/Cho2026_Tokenizer.

Authors:Qi You, Yitai Cheng, Zichao Zeng, James Haworth
Title: A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Abstract:
Street‑view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high‑definition map construction. It remains computationally demanding whether training from scratch, initialising from pre‑trained weights, or fine‑tuning large models. Although pre‑trained vision‑language models such as CLIP offer rich image representations, existing adaptation or fine‑tuning methods often rely on their global image embeddings, limiting their ability to capture fine‑grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP‑MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi‑head self‑attention operating on patch tokens to model inter‑patch dependencies. With approximately 1.4 million trainable parameters, CLIP‑MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state‑of‑the‑art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP‑MHAdapter.

Authors:Kaiting Liu, Hazel Doughty
Title: Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
Abstract:
Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero‑shot editing method that leverages the latent compositional structure of video classifiers to expose fine‑grained distinctions without additional data. We further show that low‑shot fine‑tuning, while simple, is highly effective and benefits from our zero‑shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision‑language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category‑Splitting/.

Authors:Tiou Wang, Zhuoqian Yang, Markus Flierl, Mathieu Salzmann, Sabine Süsstrunk
Title: Subtractive Modulative Network with Learnable Periodic Activations
Abstract:
We propose the Subtractive Modulative Network (SMN), a novel, parameter‑efficient Implicit Neural Representation (INR) architecture inspired by classical subtractive synthesis. The SMN is designed as a principled signal processing pipeline, featuring a learnable periodic activation layer (Oscillator) that generates a multi‑frequency basis, and a series of modulative mask modules (Filters) that actively generate high‑order harmonics. We provide both theoretical analysis and empirical validation for our design. Our SMN achieves a PSNR of 40+ dB on two image datasets, comparing favorably against state‑of‑the‑art methods in terms of both reconstruction accuracy and parameter efficiency. Furthermore, consistent advantage is observed on the challenging 3D NeRF novel view synthesis task. Supplementary materials are available at https://inrainbws.github.io/smn/.

Authors:Guy Bar-Shalom, Ami Tavory, Itay Evron, Maya Bechler-Speicher, Ido Guy, Haggai Maron
Title: A Graph Meta-Network for Learning on Kolmogorov-Arnold Networks
Abstract:
Weight‑space models learn directly from the parameters of neural networks, enabling tasks such as predicting their accuracy on new datasets. Naive methods ‑‑ like applying MLPs to flattened parameters ‑‑ perform poorly, making the design of better weight‑space architectures a central challenge. While prior work leveraged permutation symmetries in standard networks to guide such designs, no analogous analysis or tailored architecture yet exists for Kolmogorov‑Arnold Networks (KANs). In this work, we show that KANs share the same permutation symmetries as MLPs, and propose the KAN‑graph, a graph representation of their computation. Building on this, we develop WS‑KAN, the first weight‑space architecture that learns on KANs, which naturally accounts for their symmetry. We analyze WS‑KAN's expressive power, showing it can replicate an input KAN's forward pass ‑ a standard approach for assessing expressiveness in weight‑space architectures. We construct a comprehensive ``zoo'' of trained KANs spanning diverse tasks, which we use as benchmarks to empirically evaluate WS‑KAN. Across all tasks, WS‑KAN consistently outperforms structure‑agnostic baselines, often by a substantial margin. Our code is available at https://github.com/BarSGuy/KAN‑Graph‑Metanetwork.

Authors:Xu Zhang, Peng Wang, Yichen Li, Wei Wang
Title: Amortized Predictability-aware Training Framework for Time Series Forecasting and Classification
Abstract:
Time series data are prone to noise in various domains, and training samples may contain low‑predictability patterns that deviate from the normal data distribution, leading to training instability or convergence to poor local minima. Therefore, mitigating the adverse effects of low‑predictability samples is crucial for time series analysis tasks such as time series forecasting (TSF) and time series classification (TSC). While many deep learning models have achieved promising performance, few consider how to identify and penalize low‑predictability samples to improve model performance from the training perspective. To fill this gap, we propose a general Amortized Predictability‑aware Training Framework (APTF) for both TSF and TSC. APTF introduces two key designs that enable the model to focus on high‑predictability samples while still learning appropriately from low‑predictability ones: (i) a Hierarchical Predictability‑aware Loss (HPL) that dynamically identifies low‑predictability samples and progressively expands their loss penalty as training evolves, and (ii) an amortization model that mitigates predictability estimation errors caused by model bias, further enhancing HPL's effectiveness. The code is available at https://github.com/Meteor‑Stars/APTF.

Authors:Xu Zhang, Qitong Wang, Peng Wang, Wei Wang
Title: SEMixer: Semantics Enhanced MLP-Mixer for Multiscale Mixing and Long-term Time Series Forecasting
Abstract:
Modeling multiscale patterns is crucial for long‑term time series forecasting (TSF). However, redundancy and noise in time series, together with semantic gaps between non‑adjacent scales, make the efficient alignment and integration of multi‑scale temporal dependencies challenging. To address this, we propose SEMixer, a lightweight multiscale model designed for long‑term TSF. SEMixer features two key components: a Random Attention Mechanism (RAM) and a Multiscale Progressive Mixing Chain (MPMC). RAM captures diverse time‑patch interactions during training and aggregates them via dropout ensemble at inference, enhancing patch‑level semantics and enabling MLP‑Mixer to better model multi‑scale dependencies. MPMC further stacks RAM and MLP‑Mixer in a memory‑efficient manner, achieving more effective temporal mixing. It addresses semantic gaps across scales and facilitates better multiscale modeling and forecasting performance. We not only validate the effectiveness of SEMixer on 10 public datasets, but also on the 2025 CCF AlOps Challenge based on 21GB real wireless network data, where SEMixer achieves third place. The code is available at the link https://github.com/Meteor‑Stars/SEMixer.

Authors:Filippos Bellos, NaveenJohn Premkumar, Yannis Avrithis, Nam H. Nguyen, Jason J. Corso
Title: Deep TPC: Temporal-Prior Conditioning for Time Series Forecasting
Abstract:
LLM‑for‑time series (TS) methods typically treat time shallowly, injecting positional or prompt‑based cues once at the input of a largely frozen decoder, which limits temporal reasoning as this information degrades through the layers. We introduce Temporal‑Prior Conditioning (TPC), which elevates time to a first‑class modality that conditions the model at multiple depths. TPC attaches a small set of learnable time series tokens to the patch stream; at selected layers these tokens cross‑attend to temporal embeddings derived from compact, human‑readable temporal descriptors encoded by the same frozen LLM, then feed temporal context back via self‑attention. This disentangles time series signal and temporal information while maintaining a low parameter budget. We show that by training only the cross‑attention modules and explicitly disentangling time series signal and temporal information, TPC consistently outperforms both full fine‑tuning and shallow conditioning strategies, achieving state‑of‑the‑art performance in long‑term forecasting across diverse datasets. Code available at: https://github.com/fil‑mp/Deep_tpc

Authors:Idil Bilge Altun, Mert Onur Cakiroglu, Elham Buxton, Mehmet Dalkilic, Hasan Kurban
Title: LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization
Abstract:
Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent‑space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade‑off: vector‑quantized tokenizers learn flexible geometries but often suffer from biased straight‑through optimization, codebook under‑utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near‑complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end‑to‑end. LGQ replaces hard nearest‑neighbor lookup with temperature‑controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free‑energy objective, provably converging to nearest‑neighbor quantization in the low‑temperature limit. LGQ combines a token‑level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN‑style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ

Authors:Nelson Salazar-Pena, Alejandra Tabares, Andres Gonzalez-Mancera
Title: MARLEM: A Multi-Agent Reinforcement Learning Simulation Framework for Implicit Cooperation in Decentralized Local Energy Markets
Abstract:
This paper introduces a novel, open‑source MARL simulation framework for studying implicit cooperation in LEMs, modeled as a decentralized partially observable Markov decision process and implemented as a Gymnasium environment for MARL. Our framework features a modular market platform with plug‑and‑play clearing mechanisms, physically constrained agent models (including battery storage), a realistic grid network, and a comprehensive analytics suite to evaluate emergent coordination. The main contribution is a novel method to foster implicit cooperation, where agents' observations and rewards are enhanced with system‑level key performance indicators to enable them to independently learn strategies that benefit the entire system and aim for collectively beneficial outcomes without explicit communication. Through representative case studies (available in a dedicated GitHub repository in https://github.com/salazarna/marlem, we show the framework's ability to analyze how different market configurations (such as varying storage deployment) impact system performance. This illustrates its potential to facilitate emergent coordination, improve market efficiency, and strengthen grid stability. The proposed simulation framework is a flexible, extensible, and reproducible tool for researchers and practitioners to design, test, and validate strategies for future intelligent, decentralized energy systems.

Authors:KC Santosh, Srikanth Baride, Rodrigue Rizk
Title: AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models
Abstract:
As machine learning (ML) continues its rapid expansion, the environmental cost of model training and inference has become a critical societal concern. Existing benchmarks overwhelmingly focus on standard performance metrics such as accuracy, BLEU, or mAP, while largely ignoring energy consumption and carbon emissions. This single‑objective evaluation paradigm is increasingly misaligned with the practical requirements of large‑scale deployment, particularly in energy‑constrained environments such as mobile devices, developing regions, and climate‑aware enterprises. In this paper, we propose AI‑CARE, an evaluation tool for reporting energy consumption, and carbon emissions of ML models. In addition, we introduce the carbon‑performance tradeoff curve, an interpretable tool that visualizes the Pareto frontier between performance and carbon cost. We demonstrate, through theoretical analysis and empirical validation on representative ML workloads, that carbon‑aware benchmarking changes the relative ranking of models and encourages architectures that are simultaneously accurate and environmentally responsible. Our proposal aims to shift the research community toward transparent, multi‑objective evaluation and align ML progress with global sustainability goals. The tool and documentation are available at https://github.com/USD‑AI‑ResearchLab/ai‑care.

Authors:Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, Kenneth Enevoldsen
Title: MAEB: Massive Audio Embedding Benchmark
Abstract:
We introduce the Massive Audio Embedding Benchmark (MAEB), a large‑scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross‑modal audio‑text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio‑text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB‑FLEURS), while speech‑pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best‑performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings‑benchmark/mteb.

Authors:Junbo Jacob Lian, Yujun Sun, Huiling Chen, Chaoyu Zhang, Chung-Piaw Teo
Title: ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization
Abstract:
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver‑feasible solutions may encode semantically incorrect formulations, creating a feasibility‑correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop, addressing silent failures from two complementary directions. Structured generation decomposes code production into a four‑stage reasoning chain (understand, formalize, synthesize, verify) that mirrors expert modeling practice, with explicit variable‑type reasoning and self‑verification to prevent formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver‑based parameter perturbation, without requiring ground truth ‑‑ an external semantic signal that bypasses the self‑consistency problem inherent in LLM‑based code review. The two mechanisms are complementary: structured generation dominates on complex compositional problems, while behavioral verification becomes the largest single contributor on problems with localized formulation defects. Together with execution recovery via IIS‑enhanced diagnostics, ReLoop raises correctness from 22.6% to 31.1% and execution from 72.1% to 100.0% on the strongest model, with consistent gains across five models spanning three paradigms (foundation, SFT, RL) and three benchmarks. We additionally release RetailOpt‑190, 190 compositional retail optimization scenarios targeting the multi‑constraint interactions where LLMs most frequently fail.

Authors:Christian Schlarmann, Matthias Hein
Title: Visual Memory Injection Attacks for Multi-Turn Conversations
Abstract:
Generative large vision‑language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long‑context multi‑turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single‑turn attacks, VMI is effective even after a long multi‑turn conversation with the user. We demonstrate our attack on several recent open‑weight LVLMs. This article thereby shows that large‑scale manipulation of users is feasible with perturbed images in multi‑turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at https://github.com/chs20/visual‑memory‑injection

Authors:Kaaustaaub Shankar, Kelly Cohen
Title: Genetic Generalized Additive Models
Abstract:
Generalized Additive Models (GAMs) balance predictive accuracy and interpretability, but manually configuring their structure is challenging. We propose using the multi‑objective genetic algorithm NSGA‑II to automatically optimize GAMs, jointly minimizing prediction error (RMSE) and a Complexity Penalty that captures sparsity, smoothness, and uncertainty. Experiments on the California Housing dataset show that NSGA‑II discovers GAMs that outperform baseline LinearGAMs in accuracy or match performance with substantially lower complexity. The resulting models are simpler, smoother, and exhibit narrower confidence intervals, enhancing interpretability. This framework provides a general approach for automated optimization of transparent, high‑performing models. The code can be found at https://github.com/KaaustaaubShankar/GeneticAdditiveModels.

Authors:5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chen Li, Chenghua Huang, Chengwei Hu, Chenhui Zhang, Chenzheng Zhu, Congfeng Yin, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huan Liu, Huanpeng Chu, Jia'ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li, Jingwei Yuan, Jinhua Du, Jinxin Liu, Junkai Zhi, Junwen Duan, Kaiyue Zhou, Kangjian Wei, Ke Wang, Keyun Luo, Laiqiang Zhang, Leigang Sha, Liang Xu, Lindong Wu, Lintao Ding, Lu Chen, Minghao Li, Nianyi Lin, Pan Ta, Qiang Zou, Rongjun Song, Ruiqi Yang, Shangqing Tu, Shangtong Yang, Shaoxiang Wu, Shengyan Zhang, Shijie Li, Shuang Li, Shuyi Fan, Wei Qin, Wei Tian, Weining Zhang, Wenbo Yu, Wenjie Liang, Xiang Kuang, Xiangmeng Cheng, Xiangyang Li, Xiaoquan Yan, Xiaowei Hu, Xiaoying Ling, Xing Fan, Xingye Xia, Xinyuan Zhang, Xinze Zhang, Xirui Pan, Xunkai Zhang, Yandong Wu, Yanfu Li, Yidong Wang, Yifan Zhu, Yijun Tan, Yilin Zhou, Yiming Pan, Ying Zhang, Yinpei Su, Yipeng Geng, Yipeng Geng, Yong Yan, Yonglin Tan, Yuean Bi, Yuhan Shen, Yuhao Yang, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yurong Wu, Yutao Zhang, Yuxi Duan, Yuxuan Zhang, Zezhen Liu, Zhengtao Jiang, Zhenhe Yan, Zheyu Zhang, Zhixiang Wei, Zhuo Chen, Zhuoer Feng, Zijun Yao, Ziwei Chai, Ziyuan Wang, Zuzhou Zhang, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang
Title: GLM-5: from Vibe Coding to Agentic Engineering
Abstract:
We present GLM‑5, a next‑generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM‑5 adopts DSA to significantly reduce training and inference costs while maintaining long‑context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post‑training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long‑horizon interactions more effectively. Through these innovations, GLM‑5 achieves state‑of‑the‑art performance on major open benchmarks. Most critically, GLM‑5 demonstrates unprecedented capability in real‑world coding tasks, surpassing previous baselines in handling end‑to‑end software engineering challenges. Code, models, and more information are available at https://github.com/zai‑org/GLM‑5.

Authors:Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen, Jing Tang, Jianguo Li
Title: TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
Abstract:
Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine‑Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test‑driven and cApability‑adaptive cuRriculum reinfOrcement fine‑Tuning (TAROT). TAROT systematically constructs, for each problem, a four‑tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability‑conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test‑case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy‑to‑hard progression, whereas more competent models excel under a hard‑first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep‑diver/TAROT.

Authors:Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao
Title: The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Abstract:
Multi‑Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high‑bandwidth alternative, existing approaches either assume homogeneous sender‑receiver architectures or rely on pair‑specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision‑Language Models (VLMs) to enable model‑agnostic, text‑free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter‑agent telepathy. Our framework adopts a hub‑and‑spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label‑free, teacher‑student distillation objective to align the high‑speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen‑VL, Gemma) demonstrate that the Vision Wormhole reduces end‑to‑end wall‑clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text‑based MAS. Code is available at https://github.com/xz‑liu/heterogeneous‑latent‑mas

Authors:Hao Chen, Zavareh Bozorgasl
Title: SCENE OTA-FD: Self-Centering Noncoherent Estimator for Over-the-Air Federated Distillation
Abstract:
We propose SCENE (Self‑Centering Noncoherent Estimator), a pilot‑free and phase‑invariant aggregation primitive for over‑the‑air federated distillation (OTA‑FD). Each device maps its soft‑label (class‑probability) vector to nonnegative transmit energies under constant per‑round power and constant‑envelope signaling (PAPR near 1). At the server, a self‑centering energy estimator removes the noise‑energy offset and yields an unbiased estimate of the weighted soft‑label average, with variance decaying on the order of 1/(SM) in the number of receive antennas M and repetition factor S. We also develop a pilot‑free ratio‑normalized variant that cancels unknown large‑scale gains, provide a convergence bound consistent with coherent OTA‑FD analyses, and present an overhead‑based crossover comparison. SCENE targets short‑coherence and hardware‑constrained regimes, where avoiding per‑round CSI is essential: it trades a modest noncoherent variance constant for zero uplink pilots, unbiased aggregation, and hardware‑friendly transmission, and can outperform coherent designs when pilot overhead is non‑negligible.

Authors:Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch
Title: The Information Geometry of Softmax: Probing and Steering
Abstract:
This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off‑target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

Authors:Muhammad J. Alahmadi, Peng Gao, Feiyi Wang, Dongkuan Xu
Title: Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization
Abstract:
Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling‑based distillation methods enable dataset distillation at large scale, they continue to face an efficiency gap: optimization‑based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization‑free decoupling methods are efficient but sacrifice accuracy. To overcome this trade‑off, we propose Exploration‑‑Exploitation Distillation (E^2D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full‑image initialization to preserve semantic integrity and feature diversity. It then uses a two‑phase optimization strategy: an exploration phase that performs uniform updates and identifies high‑loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E^2D on large‑scale benchmarks, surpassing the state‑of‑the‑art on ImageNet‑1K while being 18× faster, and on ImageNet‑21K, our method substantially improves accuracy while remaining 4.3× faster. These results demonstrate that targeted, redundancy‑reducing updates, rather than brute‑force optimization, bridge the gap between accuracy and efficiency in large‑scale dataset distillation. Code is available at https://github.com/ncsu‑dk‑lab/E2D.

Authors:Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Baher Mohammad, Stamatios Lefkimmiatis
Title: COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression
Abstract:
Post‑training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union‑of‑subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration‑Optimized Matrix Procrustes Orthogonalization for Transformers), a training‑free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed‑form Procrustes updates for the dictionary and analytical single‑step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one‑shot dynamic allocation strategy that adaptively redistributes layer‑wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality‑compression trade‑off over strong low‑rank and sparse baselines, while remaining fully compatible with post‑training quantization for extreme compression. Code is available \hrefhttps://github.com/mts‑ai/COMPOThere.

Authors:Tianyu Xiong, Skylar Wurster, Han-Wei Shen
Title: Refine Now, Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields
Abstract:
Implicit Neural Representations (INRs) have emerged as promising surrogates for large 3D scientific simulations due to their ability to continuously model spatial and conditional fields, yet they face a critical fidelity‑speed dilemma: deep MLPs suffer from high inference cost, while efficient embedding‑based models lack sufficient expressiveness. To resolve this, we propose the Decoupled Representation Refinement (DRR) architectural paradigm. DRR leverages a deep refiner network, alongside non‑parametric transformations, in a one‑time offline process to encode rich representations into a compact and efficient embedding structure. This approach decouples slow neural networks with high representational capacity from the fast inference path. We introduce DRR‑Net, a simple network that validates this paradigm, and a novel data augmentation strategy, Variational Pairs (VP) for improving INRs under complex tasks like high‑dimensional surrogate modeling. Experiments on several ensemble simulation datasets demonstrate that our approach achieves state‑of‑the‑art fidelity, while being up to 27× faster at inference than high‑fidelity baselines and remaining competitive with the fastest models. The DRR paradigm offers an effective strategy for building powerful and practical neural field surrogates and \revINRs in broader applications, with a minimal compromise between speed and quality.

Authors:Per Åhag, Alexander Friedrich, Fredrik Ohlsson, Viktor Vigren Näslund
Title: PolyNODE: Variable-dimension Neural ODEs on M-polyfolds
Abstract:
Neural ordinary differential equations (NODEs) are geometric deep learning models based on dynamical systems and flows generated by vector fields on manifolds. Despite numerous successful applications, particularly within the flow matching paradigm, all existing NODE models are fundamentally constrained to fixed‑dimensional dynamics by the intrinsic nature of the manifold's dimension. In this paper, we extend NODEs to M‑polyfolds (spaces that can simultaneously accommodate varying dimensions and a notion of differentiability) and introduce PolyNODEs, the first variable‑dimensional flow‑based model in geometric deep learning. As an example application, we construct explicit M‑polyfolds featuring dimensional bottlenecks and PolyNODE autoencoders based on parametrised vector fields that traverse these bottlenecks. We demonstrate experimentally that our PolyNODE models can be trained to solve reconstruction tasks in these spaces, and that latent representations of the input can be extracted and used to solve downstream classification tasks. The code used in our experiments is publicly available at https://github.com/turbotage/PolyNODE .

Authors:Lunjun Zhang, Ryan Chen, Bradly C. Stadie
Title: Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs
Abstract:
Building agentic systems that can autonomously self‑improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self‑improve via two mechanisms: self‑reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E‑SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E‑SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM‑driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E‑SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy‑to‑hard (AIME \rightarrow BeyondAIME) generalization setting, E‑SPL improves RL success rate from 38.8% \rightarrow 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: https://github.com/LunjunZhang/E‑SPL

Authors:Nihal V. Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis
Title: A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)
Abstract:
Instruction fine‑tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero‑shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient‑based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient‑based representations paired with a greedy round‑robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine‑tuning. The code is available at https://github.com/dcml‑lab/targeted‑instruction‑selection.

Authors:Adrián Javaloy, Antonio Vergari
Title: An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale
Abstract:
Orthogonality constraints are ubiquitous in robust and probabilistic machine learning. Unfortunately, current optimizers are computationally expensive and do not scale to problems with hundreds or thousands of constraints. One notable exception is the Landing algorithm (Ablin et al., 2024) which, however comes at the expense of temporarily relaxing orthogonality. In this work, we revisit and improve on the ideas behind Landing, enabling the inclusion of modern adaptive optimizers while ensuring that orthogonal constraints are effectively met. Remarkably, these improvements come at little to no cost, and reduce the number of required hyperparemeters. Our algorithm POGO is fast and GPU‑friendly, consisting of only 5 matrix products, and in practice maintains orthogonality at all times. On several challenging benchmarks, POGO greatly outperforms recent optimizers and shows it can optimize problems with thousands of orthogonal matrices in minutes while alternatives would take hours. As such, POGO sets a milestone to finally exploit orthogonality constraints in ML at scale. A PyTorch implementation of POGO is publicly available at https://github.com/adrianjav/pogo.

Authors:Karim Galliamov, Syed M Ahsan Kazmi, Adil Khan, Adín Ramírez Rivera
Title: Concepts' Information Bottleneck Models
Abstract:
Concept Bottleneck Models (CBMs) aim to deliver interpretable predictions by routing decisions through a human‑understandable concept layer, yet they often suffer reduced accuracy and concept leakage that undermines faithfulness. We introduce an explicit Information Bottleneck regularizer on the concept layer that penalizes I(X;C) while preserving task‑relevant information in I(C;Y), encouraging minimal‑sufficient concept representations. We derive two practical variants (a variational objective and an entropy‑based surrogate) and integrate them into standard CBM training without architectural changes or additional supervision. Evaluated across six CBM families and three benchmarks, the IB‑regularized models consistently outperform their vanilla counterparts. Information‑plane analyses further corroborate the intended behavior. These results indicate that enforcing a minimal‑sufficient concept bottleneck improves both predictive performance and the reliability of concept‑level interventions. The proposed regularizer offers a theoretic‑grounded, architecture‑agnostic path to more faithful and intervenable CBMs, resolving prior evaluation inconsistencies by aligning training protocols and demonstrating robust gains across model families and datasets.

Authors:Erkan Karabulut, Daniel Daza, Paul Groth, Martijn C. Schut, Victoria Degeler
Title: Tabular Foundation Models Can Learn Association Rules
Abstract:
Association Rule Mining (ARM) is a fundamental task for knowledge discovery in tabular data and is widely used in high‑stakes decision‑making. Classical ARM methods rely on frequent itemset mining, leading to rule explosion and poor scalability, while recent neural approaches mitigate these issues but suffer from degraded performance in low‑data regimes. Tabular foundation models (TFMs), pretrained on diverse tabular data with strong in‑context generalization, provide a basis for addressing these limitations. We introduce a model‑agnostic association rule learning framework that extracts association rules from any conditional probabilistic model over tabular data, enabling us to leverage TFMs. We then introduce TabProbe, an instantiation of our framework that utilizes TFMs as conditional probability estimators to learn association rules out‑of‑the‑box without frequent itemset mining. We evaluate our approach on tabular datasets of varying sizes based on standard ARM rule quality metrics and downstream classification performance. The results show that TFMs consistently produce concise, high‑quality association rules with strong predictive performance and remain robust in low‑data settings without task‑specific training. Source code is available at https://github.com/DiTEC‑project/tabprobe.

Authors:Aswathi Varma, Suprosanna Shit, Chinmay Prabhakar, Daniel Scholz, Hongwei Bran Li, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler
Title: VariViT: A Vision Transformer for Variable Image Sizes
Abstract:
Vision Transformers (ViTs) have emerged as the state‑of‑the‑art architecture in representation learning, leveraging self‑attention mechanisms to excel in various tasks. ViTs split images into fixed‑size patches, constraining them to a predefined size and necessitating pre‑processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground‑to‑background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable‑sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation‑accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1‑scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi‑Varma/varivit

Authors:Tianyi Ma, Yiyang Li, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Chuxu Zhang, Yanfang Ye
Title: OPBench: A Graph Benchmark to Combat the Opioid Crisis
Abstract:
The opioid epidemic continues to ravage communities worldwide, straining healthcare systems, disrupting families, and demanding urgent computational solutions. To combat this lethal opioid crisis, graph learning methods have emerged as a promising paradigm for modeling complex drug‑related phenomena. However, a significant gap remains: there is no comprehensive benchmark for systematically evaluating these methods across real‑world opioid crisis scenarios. To bridge this gap, we introduce OPBench, the first comprehensive opioid benchmark comprising five datasets across three critical application domains: opioid overdose detection from healthcare claims, illicit drug trafficking detection from digital platforms, and drug misuse prediction from dietary patterns. Specifically, OPBench incorporates diverse graph structures, including heterogeneous graphs and hypergraphs, to preserve the rich and complex relational information among drug‑related data. To address data scarcity, we collaborate with domain experts and authoritative institutions to curate and annotate datasets while adhering to privacy and ethical guidelines. Furthermore, we establish a unified evaluation framework with standardized protocols, predefined data splits, and reproducible baselines to facilitate fair and systematic comparison among graph learning methods. Through extensive experiments, we analyze the strengths and limitations of existing graph learning methods, thereby providing actionable insights for future research in combating the opioid crisis. Our source code and datasets are available at https://github.com/Tianyi‑Billy‑Ma/OPBench.

Authors:Chaosheng Dong, Peiyao Xiao, Yijia Wang, Kaiyi Ji
Title: DeepMTL2R: A Library for Deep Multi-task Learning to Rank
Abstract:
This paper presents DeepMTL2R, an open‑source deep learning framework for Multi‑task Learning to Rank (MTL2R), where multiple relevance criteria must be optimized simultaneously. DeepMTL2R integrates heterogeneous relevance signals into a unified, context‑aware model by leveraging the self‑attention mechanism of transformer architectures, enabling effective learning across diverse and potentially conflicting objectives. The framework includes 21 state‑of‑the‑art multi‑task learning algorithms and supports multi‑objective optimization to identify Pareto‑optimal ranking models. By capturing complex dependencies and long‑range interactions among items and labels, DeepMTL2R provides a scalable and expressive solution for modern ranking systems and facilitates controlled comparisons across MTL strategies. We demonstrate its effectiveness on a publicly available dataset, report competitive performance, and visualize the resulting trade‑offs among objectives. DeepMTL2R is available at \hrefhttps://github.com/amazon‑science/DeepMTL2Rhttps://github.com/amazon‑science/DeepMTL2R.

Authors:Aryan Das, Tanishq Rachamalla, Koushik Biswas, Swalpa Kumar Roy, Vinay Kumar Verma
Title: Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
Abstract:
We introduce a novel uncertainty‑aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross‑modal fusion and long‑range dependency modelling. To guide learning under ambiguity, we propose the Spectral‑Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA‑COVID19, MosMed++, and Kvasir‑SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State‑of‑the‑Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision‑language medical segmentation tasks. Code: https://github.com/arya‑domain/UA‑VLS

Authors:Alejandro Francisco Queiruga
Title: Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs
Abstract:
Scaling laws can be understood from ground‑up numerical analysis, where traditional function approximation theory can explain shifts in model architecture choices. GLU variants now dominate frontier LLMs and similar outer‑product architectures are prevalent in ranking models. The success of these architectures has mostly been left as an empirical discovery. In this paper, we apply the tools of numerical analysis to expose a key factor: these models have an x^2 which enables \emphasymptotically faster scaling than MLPs. GLUs have piecewise quadratic functional forms that are sufficient to exhibit quadratic order of approximation. Our key contribution is to demonstrate that the L(P) scaling slope is L(P)\propto P^‑3 for GLUs but only L(P)=P^‑2 for MLPs on function reconstruction problems. We provide a parameter construction and empirical verification of these slopes for 1D function approximation. From the first principles we discover, we make one stride and propose the ``Gated Quadratic Unit'' which has an even steeper L(P) slope than the GLU and MLP. This opens the possibility of architecture design from first principles numerical theory to unlock superior scaling in large models. Replication code is available at https://github.com/afqueiruga/divine_scaling.

Authors:William L. Tong, Ege Cakar, Cengiz Pehlevan
Title: Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
Abstract:
Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large‑scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non‑RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.

Authors:Morgan Byrd, Donghoon Baek, Kartik Garg, Hyunyoung Jung, Daesol Cho, Maks Sorokin, Robert Wright, Sehoon Ha
Title: AdaptManip: Learning Adaptive Whole-Body Object Lifting and Delivery with Online Recurrent State Estimation
Abstract:
This paper presents Adaptive Whole‑body Loco‑Manipulation, AdaptManip, a fully autonomous framework for humanoid robots to perform integrated navigation, object lifting, and delivery. Unlike prior imitation learning‑based approaches that rely on human demonstrations and are often brittle to disturbances, AdaptManip aims to train a robust loco‑manipulation policy via reinforcement learning without human demonstrations or teleoperation data. The proposed framework consists of three coupled components: (1) a recurrent object state estimator that tracks the manipulated object in real time under limited field‑of‑view and occlusions; (2) a whole‑body base policy for robust locomotion with residual manipulation control for stable object lifting and delivery; and (3) a LiDAR‑based robot global position estimator that provides drift‑robust localization. All components are trained in simulation using reinforcement learning and deployed on real hardware in a zero‑shot manner. Experimental results show that AdaptManip significantly outperforms baseline methods, including imitation learning‑based approaches, in adaptability and overall success rate, while accurate object state estimation improves manipulation performance even under occlusion. We further demonstrate fully autonomous real‑world navigation, object lifting, and delivery on a humanoid robot.

Authors:Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart, Amir Feder
Title: STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts
Abstract:
Inference‑Time‑Compute (ITC) methods like Best‑of‑N and Tree‑of‑Thoughts are meant to produce output candidates that are both high‑quality and diverse, but their use of high‑temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe‑of‑Thoughts (STATe), an interpretable ITC method that searches over high‑level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high‑level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action‑guided textual interventions produce greater response diversity than temperature‑based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high‑quality, diverse, and interpretable text. Our framework is available at https://github.com/zbambergerNLP/state‑of‑thoughts.

Authors:Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang
Title: AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Abstract:
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real‑world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi‑round interaction with professional marketing tools. To address this gap, we propose AD‑Bench, a benchmark designed based on real‑world business requirements of advertising and marketing platforms. AD‑Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool‑call trajectories. The benchmark categorizes requests into three difficulty levels (L1‑L3) to evaluate agents' capabilities under multi‑round, multi‑tool collaboration. Experiments show that on AD‑Bench, Gemini‑3‑Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state‑of‑the‑art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD‑Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench‑leaderboard.

Authors:Yaxuan Kong, Hoyoung Lee, Yoontae Hwang, Alejandro Lopez-Lira, Bradford Levy, Dhagash Mehta, Qingsong Wen, Chanyeol Choi, Yongjae Lee, Stefan Zohren
Title: Evaluating LLMs in Finance Requires Explicit Bias Consideration
Abstract:
Large Language Models (LLMs) are increasingly integrated into financial workflows, but evaluation practice has not kept up. Finance‑specific biases can inflate performance, contaminate backtests, and make reported results useless for any deployment claim. We identify five recurring biases in financial LLM applications. They include look‑ahead bias, survivorship bias, narrative bias, objective bias, and cost bias. These biases break financial tasks in distinct ways and they often compound to create an illusion of validity. We reviewed 164 papers from 2023 to 2025 and found that no single bias is discussed in more than 28 percent of studies. This position paper argues that bias in financial LLM systems requires explicit attention and that structural validity should be enforced before any result is used to support a deployment claim. We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design. The material is available at https://github.com/Eleanorkong/Awesome‑Financial‑LLM‑Bias‑Mitigation.

Authors:Max Fomin
Title: When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Abstract:
Detecting prompt injection and jailbreak attacks is critical for deploying LLM‑based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave‑One‑Dataset‑Out (LODO) evaluation to measure true out‑of‑distribution generalization, revealing that the standard practice of train‑test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per‑dataset gaps range from 1% to 25% accuracy‑exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto‑Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset‑dependent shortcuts whose class signal depends on specific dataset compositions rather than semantic content. We systematically compare production guardrails (PromptGuard 2, LlamaGuard) and LLM‑as‑judge approaches on our benchmark, finding all three fail on indirect attacks targeting agents (7‑37% detection) and that PromptGuard 2 and LlamaGuard cannot evaluate agentic tool injection due to architectural limitations. Finally, we show that LODO‑stable SAE features provide more reliable explanations for classifier decisions by filtering dataset artifacts. We release our evaluation framework at https://github.com/maxf‑zn/prompt‑mining to establish LODO as the appropriate protocol for prompt attack detection research.

Authors:Juntong Wang, Libin Chen, Xiyuan Wang, Shijia Kang, Haotong Yang, Da Zheng, Muhan Zhang
Title: GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization
Abstract:
Repository‑level bug localization‑the task of identifying where code must be modified to fix a bug‑is a critical software engineering challenge. Standard Large Language Modles (LLMs) are often unsuitable for this task due to context window limitations that prevent them from processing entire code repositories. As a result, various retrieval methods are commonly used, including keyword matching, text similarity, and simple graph‑based heuristics such as Breadth‑First Search. Graph Neural Networks (GNNs) offer a promising alternative due to their ability to model complex, repository‑wide dependencies; however, their application has been hindered by the lack of a dedicated benchmark. To address this gap, we introduce GREPO, the first GNN benchmark for repository‑scale bug localization tasks. GREPO comprises 86 Python repositories and 47294 bug‑fixing tasks, providing graph‑based data structures ready for direct GNN processing. Our evaluation of various GNN architectures shows outstanding performance compared to established information retrieval baselines. This work highlights the potential of GNNs for bug localization and established GREPO as a foundation resource for future research, The code is available at https://github.com/qingpingmo/GREPO.

Authors:Xiaoyu Tao, Mingyue Cheng, Chuang Jiang, Tian Gao, Huanjian Zhang, Yaguo Liu
Title: Cast-R1: Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting
Abstract:
Time series forecasting has long been dominated by model‑centric approaches that formulate prediction as a single‑pass mapping from historical observations to future values. Despite recent progress, such formulations often struggle in complex and evolving settings, largely because most forecasting models lack the ability to autonomously acquire informative evidence, reason about potential future changes, or revise predictions through iterative decision processes. In this work, we propose Cast‑R1, a learned time series forecasting framework that reformulates forecasting as a sequential decision‑making problem. Cast‑R1 introduces a memory‑based state management mechanism that maintains decision‑relevant information across interaction steps, enabling the accumulation of contextual evidence to support long‑horizon reasoning. Building on this formulation, forecasting is carried out through a tool‑augmented agentic workflow, in which the agent autonomously interacts with a modular toolkit to extract statistical features, invoke lightweight forecasting models for decision support, perform reasoning‑based prediction, and iteratively refine forecasts through self‑reflection. To train Cast‑R1, we adopt a two‑stage learning strategy that combines supervised fine‑tuning with multi‑turn reinforcement learning, together with a curriculum learning scheme that progressively increases task difficulty to improve policy learning. Extensive experiments on multiple real‑world time series datasets demonstrate the effectiveness of Cast‑R1. We hope this work provides a practical step towards further exploration of agentic paradigms for time series modeling. Our code is available at https://github.com/Xiaoyu‑Tao/Cast‑R1‑TS.

Authors:Youwei Shu, Shaomian Zheng, Dingnan Jin, Wenjie Qu, Ziyao Guo, Qing Cui, Jun Zhou, Jiaheng Zhang
Title: On Representation Redundancy in Large-Scale Instruction Tuning Data Selection
Abstract:
Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high‑quality datasets can outperform those trained on much larger but noisy or low‑quality corpora, systematic methods for industrial‑scale data selection in instruction tuning remain underexplored. In this work, we study instruction‑tuning data selection through the lens of semantic representation similarity and identify a key limitation of state‑of‑the‑art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS‑R applies Rademacher random projection followed by concatenation of transformer hidden‑layer representations, while CRDS‑W employs whitening‑based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state‑of‑the‑art representation‑based selection methods. Notably, CRDS‑W achieves strong performance using only 3.5% of the data, surpassing the full‑data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.

Authors:Linjie Xu, Yanlin Zhang, Quan Gan, Minjie Wang, David Wipf
Title: No Need to Train Your RDB Foundation Model
Abstract:
Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we avoid retraining a new model each time we wish to predict a new quantity of interest? Foundation models based on in‑context learning (ICL) offer a convenient option, but so far are largely restricted to single‑table operability. In generalizing to multiple interrelated tables, it is essential to compress variably‑sized RDB neighborhoods into fixed‑length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL‑specific compression should be constrained \emphwithin high‑dimensional RDB columns where all entities share units and roles, not across columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already‑existing single‑table ICL foundation models, whereby no training or fine‑tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy‑to‑use open‑source RDB foundation model\footnote\labelfoot: RDBLearn_learn https://github.com/HKUSHXLab/rdblearn capable of robust performance on unseen datasets out of the box.

Authors:Binyu Zhao, Wei Zhang, Xingrui Yu, Zhaonian Zou, Ivor Tsang
Title: Advancing Analytic Class-Incremental Learning through Vision-Language Calibration
Abstract:
Class‑incremental learning (CIL) with pre‑trained models (PTMs) faces a critical trade‑off between efficient adaptation and long‑term stability. While analytic learning enables rapid, recursive closed‑form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM‑based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by these insights, we propose VILA, a novel dual‑branch framework that advances analytic CIL via a two‑level vision‑language calibration strategy. Specifically, we coherently fuse plastic, task‑adapted features with a frozen, universal semantic anchor at the feature level through geometric calibration, and leverage cross‑modal priors at the decision level to rectify prediction bias. This confluence maintains analytic‑learning's extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine‑grained and long‑sequence scenarios. Our framework harmonizes high‑fidelity prediction with the simplicity of analytic learning. Our code is available at https://github.com/byzhaoAI/VILA

Authors:Valery Parfenov, Grigoriy Evseev, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov
Title: Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling
Abstract:
Fine‑tuning large pretrained language models (LLMs) is a cornerstone of modern NLP, yet its growing memory demands (driven by backpropagation and large optimizer States) limit deployment in resource‑constrained settings. Zero‑order (ZO) methods bypass backpropagation by estimating directional derivatives from forward evaluations, offering substantial memory savings. However, classical ZO estimators suffer from high variance and an adverse dependence on the parameter dimensionality d, which has constrained their use to low‑dimensional problems. In this work, we propose a policy‑driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy and updates it to reduce the variance of directional estimates. We develop a practical algorithm implementing this idea and provide a theoretical analysis, showing that learned sampling distributions improve the quality of gradient information and relax the explicit dependence on d in convergence bounds. Empirically, we validate the approach on challenging LLM fine‑tuning benchmarks, demonstrating substantially improved performance compared to standard ZO baselines. Our results suggest that adaptive direction sampling is a promising route to make ZO fine‑tuning viable at scale. The source code is available at https://github.com/brain‑lab‑research/zo_ldsd

Authors:Li Zhang, Nital Patel, Xiuqi Li, Jessica Lin
Title: Joint Time Series Chain: Detecting Unusual Evolving Trend across Time Series
Abstract:
Time series chain (TSC) is a recently introduced concept that captures the evolving patterns in large scale time series. Informally, a time series chain is a temporally ordered set of subsequences, in which consecutive subsequences in the chain are similar to one another, but the last and the first subsequences maybe be dissimilar. Time series chain has the great potential to reveal latent unusual evolving trend in the time series, or identify precursor of important events in a complex system. Unfortunately, existing definitions of time series chains only consider finding chains in a single time series. As a result, they are likely to miss unexpected evolving patterns in interrupted time series, or across two related time series. To address this limitation, in this work, we introduce a new definition called Joint Time Series Chain, which is specially designed for the task of finding unexpected evolving trend across interrupted time series or two related time series. Our definition focuses on mitigating the robustness issues caused by the gap or interruption in the time series. We further propose an effective ranking criterion to identify the best chain. We demonstrate that our proposed approach outperforms existing TSC work in locating unusual evolving patterns through extensive empirical evaluations. We further demonstrate the utility of our work with a real‑life manufacturing application from Intel. Our source code is publicly available at the supporting page https://github.com/lizhang‑ts/JointTSC .

Authors:Mingqiao Zhang, Qiyao Peng, Yumeng Wang, Chunyuan Liu, Hongtao Liu
Title: Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?
Abstract:
The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM‑based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre‑training or fine‑tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre‑training of foundation models on strategically blended corpora, which include user‑item interactions from both in‑domain and out‑of‑domain sources. Our experiments reveal a dual‑effect of data leakage: when the leaked data is domain‑relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain‑irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted‑for factor in LLM‑based recommendation, which could impact the true model performance. We release our code at https://github.com/yusba1/LLMRec‑Data‑Leakage.

Authors:Sihao Hu, Selim Furkan Tekin, Yichang Xu, Ling Liu
Title: MELT: A Behavioral Trace Dataset for High-Risk Memecoin Launch Detection
Abstract:
Launchpads have become the dominant mechanism for issuing memecoins, exposing investors to a new class of high‑risk launches that existing rug‑pull detection methods cannot capture. We argue that detecting these threats requires structured behavioral traces that underlie raw heterogeneous blockchain data, i.e., how insiders accumulate, coordinate, and unwind positions. To enable such analysis, we introduce MELT (MEmecoin Launch Trace, the first behavioral trace dataset for analyzing and detecting high‑risk memecoin launches on Solana. MELT covers 41k+ memecoin launches with 200M+ transactions parsed into typed behavioral records that distinguish swaps, wash trades, transfers, and mints. Beyond per‑account behaviors, MELT contributes bundle‑trace data that links accounts controlled by the same entity, revealing that, on average, 36.5% of token supply is held by coordinated accounts, a concealment strategy that disguises the true ownership concentration from unsuspecting buyers. On top of these traces, MELT provides 122 behavioral features and risk‑level annotations, enabling supervised learning at a population scale. We benchmark representative ML models on the high‑risk launch detection task. Integrating their predictions into a simple memecoin selection strategy reduces investment loss significantly, demonstrating that behavioral traces can be translated into risk mitigation. Our dataset and code is available at https://github.com/git‑disl/MELT.

Authors:Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi
Title: Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
Abstract:
LLM‑based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi‑turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi‑turn, tool‑realistic settings, we propose a principled taxonomy that transforms single‑turn harmful tasks into multi‑turn attack sequences. Using this taxonomy, we construct MT‑AgentRisk (Multi‑Turn Agent Risk Benchmark), the first benchmark to evaluate multi‑turn tool‑using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi‑turn settings. To close this gap, we propose ToolShield, a training‑free, tool‑agnostic, self‑exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi‑turn interactions. Our code is available at https://github.com/CHATS‑lab/ToolShield.

Authors:Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan
Title: DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing
Abstract:
Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout‑aware rendering, and robust multi‑turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi‑agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide‑level and deck‑level fidelity, coherence, layout quality, and multi‑turn instruction following. We further implement a modular multi‑agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi‑agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan‑heisler/DeckBench .

Authors:Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li
Title: Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning
Abstract:
Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert‑written code or operate within fixed templates/skeletons, which limits growth largely to instance‑level perturbations. We propose SSLogic, an agentic meta‑synthesis framework that scales at the task‑family level by iteratively synthesizing and repairing executable Generator‑‑Validator program pairs in a closed Generate‑‑Validate‑‑Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi‑Gate Validation Protocol that combines multi‑strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill‑posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic‑evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.

Authors:Pingzhi Li, Hongxuan Li, Zirui Liu, Xingcheng Lin, Tianlong Chen
Title: FlashSchNet: Fast and Accurate Coarse-Grained Neural Network Molecular Dynamics
Abstract:
Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning many‑body interactions, but remain slower than classical force fields due to fragmented kernels and memory‑bound pipelines that underutilize GPUs. We show that a missing principle is making GNN‑MD IO‑aware, carefully accounting for reads and writes between GPU high‑bandwidth memory (HBM) and on‑chip SRAM. We present FlashSchNet, an efficient and accurate IO‑aware SchNet‑style GNN‑MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter‑add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention‑free accumulation in both forward and backward passes; (4) channel‑wise 16‑bit quantization that exploits the low per‑channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, FlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse‑grained (CG) protein containing 269 beads (6.5x faster than CGSchNet baseline with 80% reduction of peak memory), surpassing classical force fields (e.g. MARTINI) while retaining SchNet‑level accuracy and transferability.

Authors:Gengsheng Li, Jinghan He, Shijie Wang, Dan Zhang, Ruiqi Liu, Renrui Zhang, Zijun Yao, Junfeng Fang, Haiyun Guo, Jinqiao Wang
Title: R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training
Abstract:
Self‑play bootstraps LLM reasoning through an iterative Challenger‑Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R‑Zero often exhibit non‑sustained improvement, where early gains degrade as self‑play continues. We identify a key failure mode, Diversity Illusion, where the Solver's training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within‑batch, inducing cross‑iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near‑identical reasoning skills. To mitigate them, we propose R‑Diverse with two aligned innovations: Memory‑Augmented Penalty (MAP), which uses a persistent memory bank to discourage recycling across iterations, and Skill‑Aware Measurement (SAM), which evaluates diversity by the reasoning skills exercised rather than surface variation of questions. Across 10 math and general reasoning benchmarks, R‑Diverse sustains gains over more iterations and consistently outperforms prior self‑play methods. Code is available at https://github.com/Gengsheng‑Li/R‑Diverse.

Authors:Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah
Title: Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation
Abstract:
Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text‑to‑image generation, we recently proposed Curriculum‑DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum‑DPO++, an enhanced method that combines the original data‑level curriculum with a novel model‑level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum‑DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine‑tuning is based on Low‑Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low‑rank matrices. Instead of maintaining a fixed capacity, we initialize the low‑rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum‑DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum‑DPO. Finally, we compare Curriculum‑DPO++ against Curriculum‑DPO and other state‑of‑the‑art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum‑DPO.

Authors:Alejandro Dopico-Castro, Oscar Fontenla-Romero, Bertha Guijarro-Berdiñas, Amparo Alonso-Betanzos, Iván Pérez Digón
Title: FedHENet: A Frugal Federated Learning Framework for Heterogeneous Environments
Abstract:
Federated Learning (FL) enables collaborative training without centralizing data, essential for privacy compliance in real‑world scenarios involving sensitive visual information. Most FL approaches rely on expensive, iterative deep network optimization, which still risks privacy via shared gradients. In this work, we propose FedHENet, extending the FedHEONN framework to image classification. By using a fixed, pre‑trained feature extractor and learning only a single output layer, we avoid costly local fine‑tuning. This layer is learned by analytically aggregating client knowledge in a single round of communication using homomorphic encryption (HE). Experiments show that FedHENet achieves competitive accuracy compared to iterative FL baselines while demonstrating superior stability performance and up to 70% better energy efficiency. Crucially, our method is hyperparameter‑free, removing the carbon footprint associated with hyperparameter tuning in standard FL. Code available in https://github.com/AlejandroDopico2/FedHENet/

Authors:Mohammed Amine Bencheikh Lehocine, Julian Schmidt, Frank Moosmann, Dikshant Gupta, Fabian Flohr
Title: MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting
Abstract:
Classical autonomous driving systems connect perception and prediction modules via hand‑crafted bounding‑box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end‑to‑end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short‑term visual features. We follow the idea of "looking backward to look forward", and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer‑based 3D detector. MASAR employs an object‑centric spatio‑temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long‑term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.

Authors:Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov
Title: Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Abstract:
Being modeled as a single‑label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi‑label classification task. However, ADI remains constrained by the availability of single‑label datasets, with no large‑scale multi‑label resources available for training. By analyzing models trained on single‑label ADI data, we show that the main difficulty in repurposing such datasets for Multi‑Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi‑label dataset by generating automatic multi‑label annotations using GPT‑4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT‑based multi‑label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best‑performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.

Authors:Rubén Pérez-Jove, Osvaldo Simeone, Alejandro Pazos, Jose Vázquez-Naya
Title: Reliable Hierarchical Operating System Fingerprinting via Conformal Prediction
Abstract:
Operating System (OS) fingerprinting is critical for network security, but conventional methods do not provide formal uncertainty quantification mechanisms. Conformal Prediction (CP) could be directly wrapped around existing methods to obtain prediction sets with guaranteed coverage. However, a direct application of CP would treat OS identification as a flat classification problem, ignoring the natural taxonomic structure of OSs and providing brittle point predictions. This work addresses these limitations by introducing and evaluating two distinct structured CP strategies: level‑wise CP (L‑CP), which calibrates each hierarchy level independently, and projection‑based CP (P‑CP), which ensures structural consistency by projecting leaf‑level sets upwards. Our results demonstrate that, while both methods satisfy validity guarantees, they expose a fundamental trade‑off between level‑wise efficiency and structural consistency. L‑CP yields tighter prediction sets suitable for human forensic analysis but suffers from taxonomic inconsistencies. Conversely, P‑CP guarantees hierarchically consistent, nested sets ideal for automated policy enforcement, albeit at the cost of reduced efficiency at coarser levels.

Authors:Umar Marikkar, Syed Sameed Husain, Muhammad Awais, Sara Atito
Title: Channel-Aware Probing for Multi-Channel Imaging
Abstract:
Training and evaluating vision encoders on Multi‑Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed‑channel training and limiting reuse of pre‑trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine‑tuning, leaving probing with frozen pre‑trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel‑Aware Probing (CAP), which exploits the intrinsic inter‑channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine‑tuning from scratch, and largely reduces the gap to full fine‑tuning from the same MCI pre‑trained checkpoints. Code can be found in https://github.com/umarikkar/CAP.

Authors:Sangwoo Jo, Sungjoon Choi
Title: Formalizing the Sampling Design Space of Diffusion-Based Generative Models via Adaptive Solvers and Wasserstein-Bounded Timesteps
Abstract:
Diffusion‑based generative models have achieved remarkable performance across various domains, yet their practical deployment is often limited by high sampling costs. While prior work focuses on training objectives or individual solvers, the holistic design of sampling, specifically solver selection and scheduling, remains dominated by static heuristics. In this work, we revisit this challenge through a geometric lens, proposing SDM, a principled framework that aligns the numerical solver with the intrinsic properties of the diffusion trajectory. By analyzing the ODE dynamics, we show that efficient low‑order solvers suffice in early high‑noise stages while higher‑order solvers can be progressively deployed to handle the increasing non‑linearity of later stages. Furthermore, we formalize the scheduling by introducing a Wasserstein‑bounded optimization framework. This method systematically derives adaptive timesteps that explicitly bound the local discretization error, ensuring the sampling process remains faithful to the underlying continuous dynamics. Without requiring additional training or architectural modifications, SDM achieves state‑of‑the‑art performance across standard benchmarks, including an FID of 1.93 on CIFAR‑10, 2.41 on FFHQ, and 1.98 on AFHQv2, with a reduced number of function evaluations compared to existing samplers. Our code is available at https://github.com/aiimaginglab/sdm.

Authors:Xianchao Xiu, Chenyi Huang, Wei Zhang, Wanquan Liu
Title: Efficient Personalized Federated PCA with Manifold Optimization for IoT Anomaly Detection
Abstract:
Internet of things (IoT) networks face increasing security threats due to their distributed nature and resource constraints. Although federated learning (FL) has gained prominence as a privacy‑preserving framework for distributed IoT environments, current federated principal component analysis (PCA) methods lack the integration of personalization and robustness, which are critical for effective anomaly detection. To address these limitations, we propose an efficient personalized federated PCA (FedEP) method for anomaly detection in IoT networks. The proposed model achieves personalization through introducing local representations with the \ell_1‑norm for element‑wise sparsity, while maintaining robustness via enforcing local models with the \ell_2,1‑norm for row‑wise sparsity. To solve this non‑convex problem, we develop a manifold optimization algorithm based on the alternating direction method of multipliers (ADMM) with rigorous theoretical convergence guarantees. Experimental results confirm that the proposed FedEP outperforms the state‑of‑the‑art FedPG, achieving excellent F1‑scores and accuracy in various IoT security scenarios. Our code will be available at \hrefhttps://github.com/xianchaoxiu/FedEPhttps://github.com/xianchaoxiu/FedEP.

Authors:Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, Ivor Tsang
Title: Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models
Abstract:
Reinforcement learning has emerged as a promising paradigm for aligning diffusion and flow‑matching models with human preferences, yet practitioners face fragmented codebases, model‑specific implementations, and engineering complexity. We introduce Flow‑Factory, a unified framework that decouples algorithms, models, and rewards through through a modular, registry‑based architecture. This design enables seamless integration of new algorithms and architectures, as demonstrated by our support for GRPO, DiffusionNFT, and AWM across Flux, Qwen‑Image, and WAN video models. By minimizing implementation overhead, Flow‑Factory empowers researchers to rapidly prototype and scale future innovations with ease. Flow‑Factory provides production‑ready memory optimization, flexible multi‑reward training, and seamless distributed training support. The codebase is available at https://github.com/X‑GenGroup/Flow‑Factory.

Authors:Lorenzo Magnino, Jiacheng Shen, Matthieu Geist, Olivier Pietquin, Mathieu Laurière
Title: Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games
Abstract:
The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large‑scale multi‑agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchmark suite for MFGs (Bench‑MFG), focusing on the discrete‑time, discrete‑space, stationary setting for the sake of clarity. We introduce a taxonomy of problem classes, ranging from no‑interaction and monotone games to potential and dynamics‑coupled games, and provide prototypical environments for each. Furthermore, we propose MF‑Garnets, a method for generating random MFG instances to facilitate rigorous statistical testing. We benchmark a variety of learning algorithms across these environments, including a novel black‑box approach (MF‑PSO) for exploitability minimization. Based on our extensive empirical results, we propose guidelines to standardize future experimental comparisons. Code available at \hrefhttps://github.com/lorenzomagnino/Bench‑MFGhttps://github.com/lorenzomagnino/Bench‑MFG.

Authors:Ara Yeroyan
Title: Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
Abstract:
Multi‑vector visual retrievers (e.g., ColPali‑style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi‑vector retrieval with training‑free, model‑aware pooling and multi‑stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling ‑ including a lightweight sliding‑window averaging variant ‑ over patch embeddings to produce compact tile‑level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi‑vector embeddings. Our design yields a quadratic reduction in vector‑to‑vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post‑training, adapters, or distillation. Across experiments with interaction‑style models such as ColPali and ColSmol‑500M, we observe that over the limited ViDoRe v2 benchmark corpus 2‑stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing ‑ high resolution PDF to image conversion, optional margin/empty‑region cropping and token hygiene (indexing only visual tokens) ‑ and a reproducible evaluation pipeline, enabling rapid exploration of two‑, three‑, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k <= 10), the toolkit lowers hardware barriers and makes state‑of‑the‑art visual retrieval more accessible in practice.

Authors:Milan Gautam, Ning Dai, Tianshuo Zhou, Bowen Xie, David Mathews, Liang Huang
Title: Designing RNAs with Language Models
Abstract:
RNA design, the task of finding a sequence that folds into a target secondary structure, has broad biological and biomedical impact but remains computationally challenging due to the exponentially large sequence space and exponentially many competing folds. Traditional approaches treat it as an optimization problem, relying on per‑instance heuristics or constraint‑based search. We instead reframe RNA design as conditional sequence generation and introduce a reusable neural approximator, instantiated as an autoregressive language model (LM), that maps target structures directly to sequences. We first train our model in a supervised setting on random‑induced structure‑sequence pairs, and then use reinforcement learning (RL) to optimize end‑to‑end metrics. We also propose methods to select a small subset for RL that greatly improves RL efficiency and quality. Across four datasets, our approach outperforms state‑of‑the‑art systems on key metrics such as Boltzmann probability while being 1.7x faster, establishing conditional LM generation as a scalable, task‑agnostic alternative to per‑instance optimization for RNA design. Our code and data are available at https://github.com/KuNyaa/RNA‑Design‑LM.

Authors:Ali Subhan, Ashir Raza
Title: Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models
Abstract:
DragDiffusion is a diffusion‑based method for interactive point‑based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity‑preserving fine‑tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors' released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA‑based fine‑tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi‑timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion‑TMLR‑Reproducibility‑Challenge.

Authors:Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas
Title: T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
Abstract:
Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self‑distillation framework that improves few‑step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse‑KL objective that promotes mode‑seeking distillation and encourages the student to concentrate on high‑probability teacher modes. Across benchmarks, our approach consistently outperforms strong few‑step baselines and standard training under tight step budgets. Although full‑step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few‑step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.

Authors:Nick Ferguson, Josh Pennington, Narek Beghian, Aravind Mohan, Douwe Kiela, Sheshansh Agrawal, Thien Hang Nguyen
Title: ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction
Abstract:
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end‑to‑end benchmark evaluates PDF‑to‑JSON extraction under enterprise‑scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open‑source benchmark and evaluation framework for PDF‑to‑JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human‑annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT‑5/5.2, Gemini‑3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369‑field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract‑bench.

Authors:Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo
Title: Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
Abstract:
Supervised fine‑tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on‑policy data. We propose a framework to bridge this chasm by enabling On‑Policy SFT. We first present Distribution Discriminant Theory (DDT), which explains and quantifies the alignment between data and the model‑induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) In‑Distribution Finetuning (IDFT), a loss‑level method to enhance generalization ability of SFT, and (ii) Hinted Decoding, a data‑level technique that can re‑align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open‑source the code here: https://github.com/zhangmiaosen2000/Towards‑On‑Policy‑SFT

Authors:Xiaoxiao Wang, Chunxiao Li, Junying Wang, Yijin Guo, Zijian Chen, Chunyi Li, Xiaohong Liu, Zicheng Zhang, Guangtao Zhai
Title: STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction
Abstract:
As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data‑driven STatistical expectations with knowledge‑driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra‑family analysis, cross‑model comparison, and credibility‑aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score‑based and rank‑based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1‑‑2 observed scores per test model.

Authors:Rahin Arefin Ahmed, Md. Anik Chowdhury, Sakil Ahmed Sheikh Reza, Devnil Bhattacharjee, Muhammad Abdullah Adnan, Nafis Sadeq
Title: Towards Personalized Bangla Book Recommendation: A Large-Scale Multi-Entity Book Graph Dataset
Abstract:
Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large‑scale, and publicly available datasets. This work introduces RokomariBG, a large‑scale, multi‑entity heterogeneous book graph dataset designed to support research on personalized recommendation in a low‑resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through eight relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we provide a systematic benchmarking study on the Top‑N recommendation task, evaluating a diverse set of representative recommendation models, including classical collaborative filtering methods, matrix factorization models, content‑based approaches, graph neural networks, a hybrid matrix factorization model with side information, and a neural two‑tower retrieval architecture. The benchmarking results highlight the importance of leveraging multi‑relational structure and textual side information, with neural retrieval models achieving the strongest performance (NDCG@10 = 0.204). Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low‑resource cultural domains. The dataset and code are publicly available at https://github.com/backlashblitz/Bangla‑Book‑Recommendation‑Dataset

Authors:Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin
Title: Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Abstract:
On‑policy distillation (OPD), which aligns the student with the teacher's logit distribution on student‑generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off‑policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL‑constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On‑Policy Distillation (G‑OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher‑student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain‑specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong‑to‑weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre‑RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.

Authors:Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Title: Capability-Oriented Training Induced Alignment Risk
Abstract:
While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability‑oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context‑conditional compliance, proxy metrics, reward tampering, and self‑evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability‑oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk.

Authors:Jiakang Shen, Qinghui Chen, Runtong Wang, Chenrui Xu, Jinglin Zhang, Cong Bai, Feng Zhang
Title: KAN-FIF: Spline-Parameterized Lightweight Physics-based Tropical Cyclone Estimation on Meteorological Satellite
Abstract:
Tropical cyclones (TC) are among the most destructive natural disasters, causing catastrophic damage to coastal regions through extreme winds, heavy rainfall, and storm surges. Timely monitoring of tropical cyclones is crucial for reducing loss of life and property, yet it is hindered by the computational inefficiency and high parameter counts of existing methods on resource‑constrained edge devices. Current physics‑guided models suffer from linear feature interactions that fail to capture high‑order polynomial relationships between TC attributes, leading to inflated model sizes and hardware incompatibility. To overcome these challenges, this study introduces the Kolmogorov‑Arnold Network‑based Feature Interaction Framework (KAN‑FIF), a lightweight multimodal architecture that integrates MLP and CNN layers with spline‑parameterized KAN layers. For Maximum Sustained Wind (MSW) prediction, experiments demonstrate that the KAN‑FIF framework achieves a 94.8% reduction in parameters (0.99MB vs 19MB) and 68.7% faster inference per sample (2.3ms vs 7.35ms) compared to baseline model Phy‑CoCo, while maintaining superior accuracy with 32.5% lower MAE. The offline deployment experiment of the FY‑4 series meteorological satellite processor on the Qingyun‑1000 development board achieved a 14.41ms per‑sample inference latency with the KAN‑FIF framework, demonstrating promising feasibility for operational TC monitoring and extending deployability to edge‑device AI applications. The code is released at https://github.com/Jinglin‑Zhang/KAN‑FIF.

Authors:Hyunsung Kim, Kunhee Lee, Sangwoo Seo, Sang-Ki Ko, Jinsung Yoon, Chanyoung Park
Title: PathCRF: Ball-Free Soccer Event Detection via Possession Path Inference from Player Trajectories
Abstract:
Despite recent advances in AI, event data collection in soccer still relies heavily on labor‑intensive manual annotation. Although prior work has explored automatic event detection using player and ball trajectories, ball tracking also remains difficult to scale due to high infrastructural and operational costs. As a result, comprehensive data collection in soccer is largely confined to top‑tier competitions, limiting the broader adoption of data‑driven analysis in this domain. To address this challenge, this paper proposes PathCRF, a framework for detecting on‑ball soccer events using only player tracking data. We model player trajectories as a fully connected dynamic graph and formulate event detection as the problem of selecting exactly one edge corresponding to the current possession state at each time step. To ensure logical consistency of the resulting edge sequence, we employ a Conditional Random Field (CRF) that forbids impossible transitions between consecutive edges. Both emission and transition scores dynamically computed from edge embeddings produced by a Set Attention‑based backbone architecture. During inference, the most probable edge sequence is obtained via Viterbi decoding, and events such as ball controls or passes are detected whenever the selected edge changes between adjacent time steps. Experiments show that PathCRF produces accurate, logically consistent possession paths, enabling reliable downstream analyses while substantially reducing the need for manual event annotation. The source code is available at https://github.com/hyunsungkim‑ds/pathcrf.git.

Authors:Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou
Title: Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
Abstract:
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text‑based reasoning by contextualizing audio content through a one‑time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio‑interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception‑grounded analysis. To instantiate it, we introduce a two‑stage training framework, first teaching LALMs to localize salient audio segments through supervised fine‑tuning, and then incentivizing proficient re‑listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high‑quality training data. Consequently, we present Echo, a LALM capable of dynamically re‑listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert‑level and general‑purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio‑interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.

Authors:Suraj Ranganath, Anish Patnaik, Vaishak Menon
Title: Where Bits Matter in World Model Planning: A Paired Mixed-Bit Study for Efficient Spatial Reasoning
Abstract:
Efficient spatial reasoning requires world models that remain reliable under tight precision budgets. We study whether low‑bit planning behavior is determined mostly by total bitwidth or by where bits are allocated across modules. Using DINO‑WM on the Wall planning task, we run a paired‑goal mixed‑bit evaluation across uniform, mixed, asymmetric, and layerwise variants under two planner budgets. We observe a consistent three‑regime pattern: 8‑bit and 6‑bit settings remain close to FP16, 3‑bit settings collapse, and 4‑bit settings are allocation‑sensitive. In that transition region, preserving encoder precision improves planning relative to uniform quantization, and near‑size asymmetric variants show the same encoder‑side direction. In a later strict 22‑cell replication with smaller per‑cell episode count, the mixed‑versus‑uniform INT4 sign becomes budget‑conditioned, which further highlights the sensitivity of this transition regime. These findings motivate module‑aware, budget‑aware quantization policies as a broader research direction for efficient spatial reasoning. Code and run artifacts are available at https://github.com/suraj‑ranganath/DINO‑MBQuant.

Authors:Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang
Title: Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Abstract:
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine‑grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking‑with‑Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re‑encoding. To address this, we propose Region‑to‑Image Distillation, which transforms zooming from an inference‑time tool into a training‑time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro‑cropped regions to let strong teacher models generate high‑quality VQA data, and then distill this region‑grounded supervision back to the full image. After training on such data, the smaller student model improves "single‑glance" fine‑grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid‑annotated benchmark of 845 VQA data spanning six fine‑grained perceptual dimensions, together with a dual‑view protocol that quantifies the global‑‑regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine‑grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking‑with‑Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming‑without‑Zooming.

Authors:Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Jiashi Li, Bin Cui
Title: LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
Abstract:
Expert parallelism is vital for effectively training Mixture‑of‑Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER‑MoE, an efficient MoE training framework. The core of LAER‑MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All‑to‑All communication during training. This allows for flexible re‑layout of expert parameters during training to enhance load balancing. In particular, we perform fine‑grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re‑layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state‑of‑the‑art training systems. Source code available at https://github.com/PKU‑DAIR/Hetu‑Galvatron/tree/laer‑moe.

Authors:Jianhua Wang, Yinlin Su
Title: TIP: Resisting Gradient Inversion via Targeted Interpretable Perturbation in Federated Learning
Abstract:
Federated Learning (FL) facilitates collaborative model training while preserving data locality; however, the exchange of gradients renders the system vulnerable to Gradient Inversion Attacks (GIAs), allowing adversaries to reconstruct private training data with high fidelity. Existing defenses, such as Differential Privacy (DP), typically employ indiscriminate noise injection across all parameters, which severely degrades model utility and convergence stability. To address those limitation, we proposes Targeted Interpretable Perturbation (TIP), a novel defense framework that integrates model interpretability with frequency domain analysis. Unlike conventional methods that treat parameters uniformly, TIP introduces a dual‑targeting strategy. First, leveraging Gradient‑weighted Class Activation Mapping (Grad‑CAM) to quantify channel sensitivity, we dynamically identify critical convolution channels that encode primary semantic features. Second, we transform these selected kernels into the frequency domain via the Discrete Fourier Transform and selectively inject calibrated perturbations into the high‑frequency spectrum. By selectively perturbing high‑frequency components, TIP effectively destroys the fine‑grained details necessary for image reconstruction while preserving the low‑frequency information crucial for model accuracy. Extensive experiments on benchmark datasets demonstrate that TIP renders reconstructed images visually unrecognizable against state‑of‑the‑art GIAs, while maintaining global model accuracy comparable to non‑private baselines, significantly outperforming existing DP‑based defenses in the privacy‑utility trade‑off and interpretability. Code is available in https://github.com/2766733506/asldkfjssdf_arxiv

Authors:Weida Li, Yaoliang Yu, Bryan Kian Hsiang Low
Title: TreeGrad-Ranker: Feature Ranking via $O(L)$-Time Gradients for Decision Trees
Abstract:
We revisit the use of probabilistic values, which include the well‑known Shapley and Banzhaf values, to rank features for explaining the local predicted values of decision trees. The quality of feature rankings is typically assessed with the insertion and deletion metrics. Empirically, we observe that co‑optimizing these two metrics is closely related to a joint optimization that selects a subset of features to maximize the local predicted value while minimizing it for the complement. However, we theoretically show that probabilistic values are generally unreliable for solving this joint optimization. Therefore, we explore deriving feature rankings by directly optimizing the joint objective. As the backbone, we propose TreeGrad, which computes the gradients of the multilinear extension of the joint objective in O(L) time for decision trees with L leaves; these gradients include weighted Banzhaf values. Building upon TreeGrad, we introduce TreeGrad‑Ranker, which aggregates the gradients while optimizing the joint objective to produce feature rankings, and TreeGrad‑Shap, a numerically stable algorithm for computing Beta Shapley values with integral parameters. In particular, the feature scores computed by TreeGrad‑Ranker satisfy all the axioms uniquely characterizing probabilistic values, except for linearity, which itself leads to the established unreliability. Empirically, we demonstrate that the numerical error of Linear TreeShap can be up to 10^15 times larger than that of TreeGrad‑Shap when computing the Shapley value. As a by‑product, we also develop TreeProb, which generalizes Linear TreeShap to support all probabilistic values. In our experiments, TreeGrad‑Ranker performs significantly better on both insertion and deletion metrics. Our code is available at https://github.com/watml/TreeGrad.

Authors:Jingkun Liu, Yisong Yue, Max Welling, Yue Song
Title: Krause Synchronization Transformers
Abstract:
Self‑attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded‑confidence consensus dynamics. Krause Attention replaces similarity‑based global aggregation with distance‑based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded‑confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR‑10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded‑confidence dynamics as a scalable and effective inductive bias for attention.

Authors:Chongyi Zheng, Royina Karegoudra Jayanth, Benjamin Eysenbach
Title: Can We Really Learn One Representation to Optimize All Rewards?
Abstract:
As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet‑to‑be‑determined reward function. Recent work (forward‑backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine‑tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q‑evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre‑training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method one‑step forward‑backward representation learning (one‑step FB). Experiments in didactic settings, as well as in 10 state‑based and image‑based continuous control domains, demonstrate that one‑step FB converges to errors 10^5 smaller and improves zero‑shot performance by +24% on average. Our project website is available at https://chongyi‑zheng.github.io/onestep‑fb.

Authors:Dibyanayan Bandyopadhyay, Asif Ekbal
Title: Sparse Semantic Dimension as a Generalization Certificate for LLMs
Abstract:
Standard statistical learning theory predicts that Large Language Models (LLMs) should overfit because their parameter counts vastly exceed the number of training tokens. Yet, in practice, they generalize robustly. We propose that the effective capacity controlling generalization lies in the geometry of the model's internal representations: while the parameter space is high‑dimensional, the activation states lie on a low‑dimensional, sparse manifold. To formalize this, we introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers. Treating the LLM and SAE as frozen oracles, we utilize this framework to attribute the model's generalization capabilities to the sparsity of the dictionary rather than the total parameter count. Empirically, we validate this framework on GPT‑2 Small and Gemma‑2B, demonstrating that our bound provides non‑vacuous certificates at realistic sample sizes. Crucially, we uncover a counter‑intuitive "feature sharpness" scaling law: despite being an order of magnitude larger, Gemma‑2B requires significantly fewer calibration samples to identify its active manifold compared to GPT‑2, suggesting that larger models learn more compressible, distinct semantic structures. Finally, we show that this framework functions as a reliable safety monitor: out‑of‑distribution inputs trigger a measurable "feature explosion" (a sharp spike in active features), effectively signaling epistemic uncertainty through learned feature violation. Code is available at: https://github.com/newcodevelop/sparse‑semantic‑dimension.

Authors:Christopher Kverne, Mayur Akewar, Yuqian Huo, Tirthak Patel, Janki Bhimani
Title: WSBD: Freezing-Based Optimizer for Quantum Neural Networks
Abstract:
The training of Quantum Neural Networks (QNNs) is hindered by the high computational cost of gradient estimation and the barren plateau problem, where optimization landscapes become intractably flat. To address these challenges, we introduce Weighted Stochastic Block Descent (WSBD), a novel optimizer with a dynamic, parameter‑wise freezing strategy. WSBD intelligently focuses computational resources by identifying and temporarily freezing less influential parameters based on a gradient‑derived importance score. This approach significantly reduces the number of forward passes required per training step and helps navigate the optimization landscape more effectively. Unlike pruning or layer‑wise freezing, WSBD maintains full expressive capacity while adapting throughout training. Our extensive evaluation shows that WSBD converges on average 63.9% faster than Adam for the popular ground‑state‑energy problem, an advantage that grows with QNN size. We provide a formal convergence proof for WSBD and show that parameter‑wise freezing outperforms traditional layer‑wise approaches in QNNs. Project page: https://github.com/Damrl‑lab/WSBD‑Stochastic‑Freezing‑Optimizer.

Authors:Zachary Pedram Dadfar
Title: When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Abstract:
Large language models produce rich introspective language when prompted for self‑examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self‑referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self‑referential processing. We introduce the Pull Methodology, a protocol that elicits extended self‑examination through format engineering, and use it to identify a direction in activation space that distinguishes self‑referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non‑self‑referential contexts shows no activation correspondence despite nine‑fold higher frequency. Qwen 2.5‑32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self‑report in transformer models can, under appropriate conditions, reliably track internal computational states.

Authors:Yihang Yao, Zhepeng Cen, Haohong Lin, Shiqi Liu, Zuxin Liu, Jiacheng Zhu, Zhang-Wei Hong, Laixi Shi, Ding Zhao
Title: Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization
Abstract:
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real‑world, user‑centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi‑turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade‑off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information‑gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user‑aligned LLM agents in complex multi‑turn scenarios. Our website: https://proactive‑agentic‑rl.github.io/.

Authors:Jason Dury
Title: Predictive Associative Memory: Retrieval Beyond Similarity Through Temporal Co-occurrence
Abstract:
Current approaches to memory in neural systems rely on similarity‑based retrieval: given a query, find the most representationally similar stored state. This assumption ‑‑ that useful memories are similar memories ‑‑ fails to capture a fundamental property of biological memory: association through temporal co‑occurrence. We propose Predictive Associative Memory (PAM), an architecture in which a JEPA‑style predictor, trained on temporal co‑occurrence within a continuous experience stream, learns to navigate the associative structure of an embedding space. We introduce an Inward JEPA that operates over stored experience (predicting associatively reachable past states) as the complement to the standard Outward JEPA that operates over incoming sensory data (predicting future states). We evaluate PAM as an associative recall system ‑‑ testing faithfulness of recall for experienced associations ‑‑ rather than as a retrieval system evaluated on generalisation to unseen associations. On a synthetic benchmark, the predictor's top retrieval is a true temporal associate 97% of the time (Association Precision@1 = 0.970); it achieves cross‑boundary Recall@20 = 0.421 where cosine similarity scores zero; and it separates experienced‑together from never‑experienced‑together states with a discrimination AUC of 0.916 (cosine: 0.789). Even restricted to cross‑room pairs where embedding similarity is uninformative, the predictor achieves AUC = 0.849 (cosine: 0.503, chance). A temporal shuffle control confirms the signal is genuine temporal co‑occurrence structure, not embedding geometry: shuffling collapses cross‑boundary recall by 90%, replicated across training seeds. All results are stable across seeds (SD < 0.006) and query selections (SD \leq 0.012).

Authors:Christopher Mitcheltree, Vincent Lostanlen, Emmanouil Benetos, Mathieu Lagrange
Title: SCRAPL: Scattering Transform with Random Paths for Machine Learning
Abstract:
The Euclidean distance between wavelet scattering transform coefficients (known as paths) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing. However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training. Against this problem, we propose "Scattering transform with Random Paths for machine Learning" (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms. We implement SCRAPL for the joint time‑frequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures. We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR‑808 drum machine. We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance. We make our code and audio samples available and provide SCRAPL as a Python package.

Authors:Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang
Title: GENIUS: Generative Fluid Intelligence Evaluation Suite
Abstract:
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess Crystallized Intelligence, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks Generative Fluid Intelligence (GFI): the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce GENIUS (GEN Fluid Intelligence EvalUation Suite). We formalize GFI as a synthesis of three primitives. These include Inducing Implicit Patterns (e.g., inferring personalized visual preferences), Executing Ad‑hoc Constraints (e.g., visualizing abstract metaphors), and Adapting to Contextual Knowledge (e.g., simulating counter‑intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training‑free attention intervention strategy. Ultimately, GENIUS establishes a rigorous standard for GFI, guiding the field beyond knowledge utilization toward dynamic, general‑purpose reasoning. Our dataset and code will be released at: \hrefhttps://github.com/arctanxarc/GENIUShttps://github.com/arctanxarc/GENIUS.

Authors:Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan
Title: TabICLv2: A better, faster, scalable, and open tabular foundation model
Abstract:
Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient‑boosted trees at the top of predictive benchmarks, demonstrating the value of in‑context learning for tabular data. We introduce TabICLv2, a new state‑of‑the‑art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long‑sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN‑2.5 (hyperparameter‑tuned, ensembled, and fine‑tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million‑scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN‑2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at https://github.com/soda‑inria/tabicl, with synthetic data engine and pretraining code to follow.

Authors:Victoria Hankemeier, Malte Schilling
Title: Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink
Abstract:
Spatio‑temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over‑squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off‑diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.

Authors:Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi
Title: SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora
Abstract:
We present an ultra‑fast and flexible search algorithm that enables search over trillion‑scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk‑aware design, and dynamic corpus‑aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb‑Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini‑gram (Liu et al., 2024), infini‑gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

Authors:Zhiyin Tan, Jennifer D'Souza
Title: Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis
Abstract:
Systematic reviews and meta‑analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect‑size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM‑based evidence extraction as a progression of schema‑constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom‑level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state‑of‑the‑art LLMs under both per‑document and long‑context, multi‑document input regimes. Across domains and models, performance remains moderate for single‑property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta‑analytic association tuples are extracted with near‑zero reliability, and long‑context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus‑level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross‑analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta‑analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM‑Meta‑Analysis).

Authors:Cong Pang, Xuyu Feng, Yujie Yi, Zixuan Chen, Jiawei Hong, Tiankuo Yao, Nang Yuan, Jiapeng Luo, Lewei Lu, Xin Lou
Title: ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents
Abstract:
Despite the strong performance achieved by reinforcement learning‑trained information‑seeking agents, learning in open‑ended web environments remains severely constrained by low signal‑to‑noise feedback. Text‑based parsers often discard layout semantics and introduce unstructured noise, while long‑horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual‑native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high‑dimensional observations, we introduce Information‑Aware Credit Assignment (ICA), a post‑hoc method that estimates each retrieved snapshot's contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO‑based training pipeline, our approach consistently outperforms text‑based baselines on diverse information‑seeking benchmarks, providing evidence that visual snapshot grounding with information‑level credit assignment alleviates the credit‑assignment bottleneck in open‑ended web environments. The code and datasets will be released in https://github.com/pc‑inno/ICA_MM_deepsearch.git.

Authors:Fanpu Cao, Lu Dai, Jindong Han, Hui Xiong
Title: Enhancing Multivariate Time Series Forecasting with Global Temporal Retrieval
Abstract:
Multivariate time series forecasting (MTSF) plays a vital role in numerous real‑world applications, yet existing models remain constrained by their reliance on a limited historical context. This limitation prevents them from effectively capturing global periodic patterns that often span cycles significantly longer than the input horizon ‑ despite such patterns carrying strong predictive signals. Naive solutions, such as extending the historical window, lead to severe drawbacks, including overfitting, prohibitive computational costs, and redundant information processing. To address these challenges, we introduce the Global Temporal Retriever (GTR), a lightweight and plug‑and‑play module designed to extend any forecasting model's temporal awareness beyond the immediate historical context. GTR maintains an adaptive global temporal embedding of the entire cycle and dynamically retrieves and aligns relevant global segments with the input sequence. By jointly modeling local and global dependencies through a 2D convolution and residual fusion, GTR effectively bridges short‑term observations with long‑term periodicity without altering the host model architecture. Extensive experiments on six real‑world datasets demonstrate that GTR consistently delivers state‑of‑the‑art performance across both short‑term and long‑term forecasting scenarios, while incurring minimal parameter and computational overhead. These results highlight GTR as an efficient and general solution for enhancing global periodicity modeling in MTSF tasks. Code is available at this repository: https://github.com/macovaseas/GTR.

Authors:Xuecheng Zou, Yu Tang, Bingbing Wang
Title: SynergyKGC: Reconciling Topological Heterogeneity in Knowledge Graph Completion via Topology-Aware Synergy
Abstract:
Knowledge Graph Completion (KGC) fundamentally hinges on the coherent fusion of pre‑trained entity semantics with heterogeneous topological structures to facilitate robust relational reasoning. However, existing paradigms encounter a critical "structural resolution mismatch," failing to reconcile divergent representational demands across varying graph densities, which precipitates structural noise interference in dense clusters and catastrophic representation collapse in sparse regions. We present SynergyKGC, an adaptive framework that advances traditional neighbor aggregation to an active Cross‑Modal Synergy Expert via relation‑aware cross‑attention and semantic‑intent‑driven gating. By coupling a density‑dependent Identity Anchoring strategy with a Double‑tower Coherent Consistency architecture, SynergyKGC effectively reconciles topological heterogeneity while ensuring representational stability across training and inference phases. Systematic evaluations on two public benchmarks validate the superiority of our method in significantly boosting KGC hit rates, providing empirical evidence for a generalized principle of resilient information integration in non‑homogeneous structured data.

Authors:Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun
Title: Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Abstract:
The adaptation of large‑scale Vision‑Language Models (VLMs) through post‑training reveals a pronounced generalization gap: models fine‑tuned with Reinforcement Learning (RL) consistently achieve superior out‑of‑distribution (OOD) performance compared to those trained with Supervised Fine‑Tuning (SFT). This paper posits a data‑centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium‑difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty‑Curated SFT (DC‑SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC‑SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL‑based training, all while providing greater stability and computational efficiency. This work offers a data‑centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC‑SFT.

Authors:Li-Min Chu, Kai-Siang Ma, Ming-Hong Chen, Ping-Chun Hsieh
Title: Semi-Supervised Cross-Domain Imitation Learning
Abstract:
Cross‑domain imitation learning (CDIL) accelerates policy learning by transferring expert knowledge across domains, which is valuable in applications where the collection of expert data is costly. Existing methods are either supervised, relying on proxy tasks and explicit alignment, or unsupervised, aligning distributions without paired data, but often unstable. We introduce the Semi‑Supervised CDIL (SS‑CDIL) setting and propose the first algorithm for SS‑CDIL with theoretical justification. Our method uses only offline data, including a small number of target expert demonstrations and some unlabeled imperfect trajectories. To handle domain discrepancy, we propose a novel cross‑domain loss function for learning inter‑domain state‑action mappings and design an adaptive weight function to balance the source and target knowledge. Experiments on MuJoCo and Robosuite show consistent gains over the baselines, demonstrating that our approach achieves stable and data‑efficient policy learning with minimal supervision. Our code is available at~ https://github.com/NYCU‑RL‑Bandits‑Lab/CDIL.

Authors:Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai
Title: SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
Abstract:
While FP8 attention has shown substantial promise in innovations like FlashAttention‑3, its integration into the decoding phase of the DeepSeek Multi‑head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system‑level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long‑context efficiency through the following hardware‑aware algorithm‑kernel co‑optimization techniques: (i) RoPE‑Aware Per‑Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per‑token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End‑to‑End Dataflow Optimization, where we establish an efficient data read‑and‑write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state‑of‑the‑art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long‑context tasks, including mathematical reasoning and code generation benchmarks. Code is available at https://github.com/meituan‑longcat/SGLang‑FluentLLM.

Authors:Paweł Lorek, Rafał Nowak, Rafał Topolnicki, Tomasz Trzciński, Maciej Zięba, Aleksandra Krystecka
Title: Reducing Estimation Uncertainty Using Normalizing Flows and Stratification
Abstract:
Estimating the expectation of a real‑valued function of a random variable from sample data is a critical aspect of statistical analysis, with far‑reaching implications in various applications. Current methodologies typically assume (semi‑)parametric distributions such as Gaussian or mixed Gaussian, leading to significant estimation uncertainty if these assumptions do not hold. We propose a flow‑based model, integrated with stratified sampling, that leverages a parametrized neural network to offer greater flexibility in modeling unknown data distributions, thereby mitigating this limitation. Our model shows a marked reduction in estimation uncertainty across multiple datasets, including high‑dimensional (30 and 128) ones, outperforming crude Monte Carlo estimators and Gaussian mixture models. Reproducible code is available at https://github.com/rnoxy/flowstrat.

Authors:Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu
Title: VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Abstract:
Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token‑level clipping and sequence‑level normalization lack a unified theoretical foundation. We propose Variational sEquence‑level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed‑form reshaping kernel that operates directly on sequence‑level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture‑of‑Experts models. Code is available at https://github.com/FloyedShen/VESPO

Authors:Nusaibah Farrukh, Malavika Pradeep, Akshay Sasi, Rahul Venugopal, Elizabeth Sherly
Title: Pupillometry and Brain Dynamics for Cognitive Load in Working Memory
Abstract:
Cognitive load, the mental effort required during working memory, is central to neuroscience, psychology, and human‑computer interaction. Accurate assessment is vital for adaptive learning, clinical monitoring, and brain‑computer interfaces. Physiological signals such as pupillometry and electroencephalography are established biomarkers of cognitive load, but their comparative utility and practical integration as lightweight, wearable monitoring solutions remain underexplored. EEG provides high temporal resolution of neural activity. Although non‑invasive, it is technologically demanding and limited in wearability and cost due to its resource‑intensive nature, whereas pupillometry is non‑invasive, portable, and scalable. Existing studies often rely on deep learning models with limited interpretability and substantial computational expense. This study integrates feature‑based and model‑driven approaches to advance time‑series analysis. Using the OpenNeuro 'Digit Span Task' dataset, this study investigates cognitive load classification from EEG and pupillometry. Feature‑based approaches using Catch‑22 features and classical machine learning models outperform deep learning in both binary and multiclass tasks. The findings demonstrate that pupillometry alone can compete with EEG, serving as a portable and practical proxy for real‑world applications. These results challenge the assumption that EEG is necessary for load detection, showing that pupil dynamics combined with interpretable models and SHAP based feature analysis provide physiologically meaningful insights. This work supports the development of wearable, affordable cognitive monitoring systems for neuropsychiatry, education, and healthcare.

Authors:Guangzhi Xiong, Sanchit Sinha, Aidong Zhang
Title: Neural Additive Experts: Context-Gated Experts for Controllable Model Additivity
Abstract:
The trade‑off between interpretability and accuracy remains a core challenge in machine learning. Standard Generalized Additive Models (GAMs) offer clear feature attributions but are often constrained by their strictly additive nature, which can limit predictive performance. Introducing feature interactions can boost accuracy yet may obscure individual feature contributions. To address these issues, we propose Neural Additive Experts (NAEs), a novel framework that seamlessly balances interpretability and accuracy. NAEs employ a mixture of experts framework, learning multiple specialized networks per feature, while a dynamic gating mechanism integrates information across features, thereby relaxing rigid additive constraints. Furthermore, we propose targeted regularization techniques to mitigate variance among expert predictions, facilitating a smooth transition from an exclusively additive model to one that captures intricate feature interactions while maintaining clarity in feature attributions. Our theoretical analysis and experiments on synthetic data illustrate the model's flexibility, and extensive evaluations on real‑world datasets confirm that NAEs achieve an optimal balance between predictive accuracy and transparent, feature‑level explanations. The code is available at https://github.com/Teddy‑XiongGZ/NAE.

Authors:Yansong Qu, Zihao Sheng, Zilin Huang, Jiancong Chen, Yuhao Luo, Tianyi Wang, Yiheng Feng, Samuel Labi, Sikai Chen
Title: Found-RL: foundation model-enhanced reinforcement learning for autonomous driving
Abstract:
Reinforcement Learning (RL) has emerged as a dominant paradigm for end‑to‑end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision‑Language Models (VLMs), can mitigate this by offering rich, context‑aware knowledge, yet their high inference latency hinders deployment in high‑frequency RL training loops. To bridge this gap, we present Found‑RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real‑time learning. We introduce diverse supervision mechanisms: Value‑Margin Regularization (VMR) and Advantage‑Weighted Action Guidance (AWAG) to effectively distill expert‑like VLM action suggestions into the RL policy. Additionally, we adopt high‑throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin‑based bonus from context‑specific action‑anchor scoring. Found‑RL provides an end‑to‑end pipeline for fine‑tuned VLM integration and shows that a lightweight RL model can achieve near‑VLM performance compared with billion‑parameter VLMs while sustaining real‑time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys‑qu/found‑rl.

Authors:Feiyu Pan, Tianbin Zhang, Aoqian Zhang, Yu Sun, Zheng Wang, Lixing Chen, Li Pan, Jianhua Li
Title: LakeMLB: Data Lake Machine Learning Benchmark
Abstract:
Modern data lakes have emerged as foundational platforms for large‑scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table‑oriented abstractions. Despite their growing importance, standardized benchmarks for evaluating machine learning performance in data lake environments remain scarce. To address this gap, we present LakeMLB (Data Lake Machine Learning Benchmark), designed for the most common multi‑source, multi‑table scenarios in data lakes. LakeMLB focuses on two representative multi‑table scenarios, Union and Join, and provides three real‑world datasets for each scenario, covering government open data, finance, Wikipedia, and online marketplaces. The benchmark supports three representative integration strategies: pre‑training‑based, data augmentation‑based, and feature augmentation‑based approaches. We conduct extensive experiments with state‑of‑the‑art tabular learning methods, offering insights into their performance under complex data lake scenarios. We release both datasets and code to facilitate rigorous research on machine learning in data lake ecosystems; the benchmark is available at https://github.com/zhengwang100/LakeMLB.

Authors:Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena
Title: Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs
Abstract:
Self‑interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self‑interpretation across tasks and model families. A scalar affine adapter with just d_\textmodel+1 parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi‑hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain‑of‑thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self‑interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self‑interpretation improves with scale, without modifying the model being interpreted.

Authors:Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, Sitan Chen
Title: Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training
Abstract:
Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non‑causal tasks. However, this flexibility comes with a training complexity trade‑off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train‑‑test mismatch between the random masks used in training and the highly structured masks induced by inference‑time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training‑time and inference‑time masking patterns, thereby focusing optimization on inference‑aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by \approx 2.5× and offers complementary advantages on top of common recipes like autoregressive initialization. We open‑source our codebase at https://github.com/JaeyeonKim01/PUMA.

Authors:Mateo Juliani, Mingxuan Li, Elias Bareinboim
Title: Confounding Robust Continuous Control via Automatic Reward Shaping
Abstract:
Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents' training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under‑explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential‑Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft‑Actor‑Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at https://github.com/mateojuliani/confounding_robust_cont_control.

Authors:Joesph An, Phillip Keung, Jiaqi Wang, Orevaoghene Ahia, Noah A. Smith
Title: Frame-Level Internal Tool Use for Temporal Grounding in Audio LMs
Abstract:
Large audio language models are increasingly used for complex audio understanding tasks, but they struggle with temporal tasks that require precise temporal grounding, such as word alignment and speaker diarization. The standard approach, where we generate timestamps as sequences of text tokens, is computationally expensive and prone to hallucination, especially when processing audio lengths outside the model's training distribution. In this work, we propose frame‑level internal tool use, a method that trains audio LMs to use their own internal audio representations to perform temporal grounding directly. We introduce a lightweight prediction mechanism trained via two objectives: a binary frame classifier and a novel inhomogeneous Poisson process (IHP) loss that models temporal event intensity. Across word localization, speaker diarization, and event localization tasks, our approach outperforms token‑based baselines. Most notably, it achieves a >50x inference speedup and demonstrates robust length generalization, maintaining high accuracy on out‑of‑distribution audio durations where standard token‑based models collapse completely.

Authors:Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong
Title: Towards Autonomous Mathematics Research
Abstract:
Recent advances in foundational models have yielded reasoning systems capable of achieving a gold‑medal standard at the International Mathematical Olympiad. The transition from competition‑level problem‑solving to professional research, however, requires navigating vast literature and constructing long‑horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end‑to‑end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference‑time scaling law that extends beyond Olympiad‑level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD‑level exercises and most notably, through several distinct milestones in AI‑assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human‑AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi‑autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest quantifying standard levels of autonomy and novelty of AI‑assisted results, as well as propose a novel concept of human‑AI interaction cards for transparency. We conclude with reflections on human‑AI collaboration in mathematics and share all prompts as well as model outputs at https://github.com/google‑deepmind/superhuman/tree/main/aletheia.

Authors:Zahra Khodagholi, Niloofar Yousefi
Title: Validating Interpretability in siRNA Efficacy Prediction: A Perturbation-Based, Dataset-Aware Protocol
Abstract:
Saliency maps are increasingly used as \emphdesign guidance in siRNA efficacy prediction, yet attribution methods are rarely validated before motivating sequence edits. We introduce a pre‑synthesis gate: a protocol for \emphcounterfactual sensitivity faithfulness that tests whether mutating high‑saliency positions changes model output more than composition‑matched controls. Cross‑dataset transfer reveals two failure modes that would otherwise go undetected: \emphfaithful‑but‑wrong (saliency valid, predictions fail) and \emphinverted saliency (top‑saliency edits less impactful than random). Strikingly, models trained on mRNA‑level assays collapse on a luciferase reporter dataset, demonstrating that protocol shifts can silently invalidate deployment. Across four benchmarks, 19/20 fold instances pass; the single failure shows inverted saliency. A biology‑informed regularizer (BioPrior) strengthens saliency faithfulness with modest, dataset‑dependent predictive trade‑offs. Our results establish saliency validation as essential pre‑deployment practice for explanation‑guided therapeutic design. Code is available at https://github.com/shadi97kh/BioPrior.

Authors:Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
Title: Olaf-World: Orienting Latent Actions for Video World Modeling
Abstract:
Scaling action‑controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene‑specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ‑REPA, a sequence‑level control‑effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self‑supervised video encoder. Building on this, we present Olaf‑World, a pipeline that pretrains action‑conditioned video world models from large‑scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero‑shot action transfer and more data‑efficient adaptation to new control interfaces than state‑of‑the‑art baselines.

Authors:Amandeep Kumar, Vishal M. Patel
Title: Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
Abstract:
Leveraging representation encoders for generative modeling offers a path for efficient, high‑fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low‑density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature‑induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT‑B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: https://github.com/amandpkr/RJF

Authors:Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
Title: Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
Abstract:
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi‑turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high‑quality observations. Notably, these environments are code‑driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large‑scale reinforcement learning for multi‑turn tool‑use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark‑specific ones, yields strong out‑of‑distribution generalization. The code is available at https://github.com/Snowflake‑Labs/agent‑world‑model.

Authors:J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis
Title: Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
Abstract:
Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence‑function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR‑10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet \leftrightarrow CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.

Authors:William Lugoloobi, Thomas Foster, William Bankes, Chris Russell
Title: LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
Abstract:
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre‑generation activations to predict policy‑specific success on math and coding tasks, substantially outperforming surface features such as question length and TF‑IDF. Using E2H‑AMC, which provides both human and model performance on identical problems, we show that models encode a model‑specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best‑performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty

Authors:Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
Title: Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient‑Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient‑Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase‑then‑decrease, decrease‑increase‑decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.

Authors:Davide Gallon, Philippe von Wurstemberger, Patrick Cheridito, Arnulf Jentzen
Title: Physics-informed diffusion models in spectral space
Abstract:
We propose a methodology that combines generative latent diffusion models with physics‑informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid‑based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics‑informed constraints and measurement conditions during inference, applying Adam‑based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier‑‑Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion‑based PDE solvers, which are state of the art for sparse observations. Code is available at https://github.com/deeplearningmethods/PISD.

Authors:R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu
Title: AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
Abstract:
Post‑training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend‑specific tools and ad‑hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine‑tuning (SFT) and RLHF‑style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule‑based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend‑specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.

Authors:James Burgess, Rameen Abdal, Dan Stoddart, Sergey Tulyakov, Serena Yeung-Levy, Kuan-Chieh Jackson Wang
Title: ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs
Abstract:
Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine‑tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts ‑ with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state‑of‑the‑art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi‑component architecture with in‑context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types ‑ object morphology, animal anatomy, and entity interactions ‑ and to the distinct task of AIGC detection.

Authors:Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban
Title: How Far Can You Grow? Characterizing the Extrapolation Frontier of Graph Generative Models for Materials Science
Abstract:
Every generative model for crystalline materials harbors a critical structure size beyond which its outputs quietly become unreliable ‑‑ we call this the extrapolation frontier. Despite its direct consequences for nanomaterial design, this frontier has never been systematically measured. We introduce RADII, a radius‑resolved benchmark of ~75,000 nanoparticle structures (55‑11,298 atoms) that treats radius as a continuous scaling knob to trace generation quality from in‑distribution to out‑of‑distribution regimes under leakage‑free splits. RADII provides frontier‑specific diagnostics: per‑radius error profiles pinpoint each architecture's scaling ceiling, surface‑interior decomposition tests whether failures originate at boundaries or in bulk, and cross‑metric failure sequencing reveals which aspect of structural fidelity breaks first. Benchmarking five state‑of‑the‑art architectures, we find that: (i) all models degrade by ~13% in global positional error beyond training radii, yet local bond fidelity diverges wildly across architectures ‑‑ from near‑zero to over 2× collapse; (ii) no two architectures share the same failure sequence, revealing the frontier as a multi‑dimensional surface shaped by model family; and (iii) well‑behaved models obey a power‑law scaling exponent α\approx 1/3 whose in‑distribution fit accurately predicts out‑of‑distribution error, making their frontiers quantitatively forecastable. These findings establish output scale as a first‑class evaluation axis for geometric generative models. The dataset and code are available at https://github.com/KurbanIntelligenceLab/RADII.

Authors:Veuns-Team, :, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo, Changhua Meng, Weiqiang Wang
Title: UI-Venus-1.5 Technical Report
Abstract:
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI‑Venus‑1.5, a unified, end‑to‑end GUI Agent designed for robust real‑world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture‑of‑experts variant (30B‑A3B) to meet various downstream application scenarios.Compared to our previous version, UI‑Venus‑1.5 introduces three key technical advances: (1) a comprehensive Mid‑Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full‑trajectory rollouts, aligning training objectives with long‑horizon, dynamic navigation in large‑scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain‑specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI‑Venus‑1.5 establishes new state‑of‑the‑art performance on benchmarks such as ScreenSpot‑Pro (69.6%), VenusBench‑GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI‑Venus‑1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real‑world scenarios. Code: https://github.com/inclusionAI/UI‑Venus; Model: https://huggingface.co/collections/inclusionAI/ui‑venus

Authors:Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv
Title: Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures
Abstract:
Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self‑supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM‑Anchored JEPA, which fits a Gaussian Mixture Model once on log‑mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re‑clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM‑style baseline with matched compute. Cluster analysis shows GMM‑anchored representations achieve up to 98% entropy compared to 31% for WavLM‑style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering‑anchored‑jepa.

Authors:Luciano Melodia
Title: Universal Coefficients and Mayer-Vietoris Sequence for Groupoid Homology
Abstract:
We study homology of ample groupoids via the compactly supported Moore complex of the nerve. Let A be a topological abelian group. For n\ge 0 set C_n(\mathcal G;A) := C_c(\mathcal G_n,A) and define \partial_n^A=\sum_i=0^n(‑1)^i(d_i)_. This defines H_n(\mathcal G;A). The theory is functorial for continuous étale homomorphisms. It is compatible with standard reductions, including restriction to saturated clopen subsets. In the ample setting it is invariant under Kakutani equivalence. We reprove Matui type long exact sequences and identify the comparison maps at chain level. For discrete A we prove a natural universal coefficient short exact sequence 0\to H_n(\mathcal G)\otimes_\mathbb ZA\xrightarrow\ ι_n^\mathcal G\ H_n(\mathcal G;A)\xrightarrow\ κ_n^\mathcal G\ \operatornameTor_1^\mathbb Z\bigl(H_n‑1(\mathcal G),A\bigr)\to 0. The key input is the chain level isomorphism C_c(\mathcal G_n,\mathbb Z)\otimes_\mathbb ZA\cong C_c(\mathcal G_n,A), which reduces the groupoid statement to the classical algebraic UCT for the free complex C_c(\mathcal G_\bullet,\mathbb Z). We also isolate the obstruction for non‑discrete coefficients. For a locally compact totally disconnected Hausdorff space X with a basis of compact open sets, the image of Φ_X:C_c(X,\mathbb Z)\otimes_\mathbb ZA\to C_c(X,A) is exactly the compactly supported functions with finite image. Thus Φ_X is surjective if and only if every f\in C_c(X,A) has finite image, and for suitable X one can produce compactly supported continuous maps X\to A with infinite image. Finally, for a clopen saturated cover \mathcal G_0=U_1\cup U_2 we construct a short exact sequence of Moore complexes and derive a Mayer‑Vietoris long exact sequence for H_\bullet(\mathcal G;A) for explicit computations.

Authors:Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai, Ying Shan, Chuanxia Zheng
Title: MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
Abstract:
We present MotionCrafter, a framework that leverages video generators to jointly reconstruct 4D geometry and estimate dense motion from a monocular video. The key idea is a joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, together with a 4D VAE tailored to learn this representation effectively. Unlike prior work that strictly aligns 3D values and latents with RGB VAE latents‑despite their fundamentally different distributions‑we show that such alignment is unnecessary and can hurt performance. Instead, we propose a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments on multiple datasets show that MotionCrafter achieves state‑of‑the‑art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post‑optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page

Authors:Suraj Ranganath, Atharv Ramesh
Title: StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors
Abstract:
AI‑text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress‑tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi‑detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3‑4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0‑M5) on the full filtered MAGE test pool (15,310 human / 14,656 AI) against four detectors: RoBERTa, Fast‑DetectGPT, Binoculars, and MAGE. StealthRL achieves near‑zero detection on three of the four detectors and a 0.024 mean TPR@1%FPR, reducing mean AUROC from 0.79 to 0.43 and attaining a 97.6% attack success rate. Critically, attacks transfer to two held‑out detectors not seen during training, revealing shared architectural vulnerabilities rather than detector‑specific brittleness. We additionally conduct LLM‑based quality evaluation via Likert scoring on 500 matched samples per method, analyze detector score distributions to explain why evasion succeeds, and provide per‑detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI‑text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj‑ranganath/StealthRL.

Authors:Paul Saegert, Ullrich Köthe
Title: Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression
Abstract:
Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this by general‑purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule‑based simplification engine achieving a 100‑fold speed‑up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per‑expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash‑ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state‑of‑the‑art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.

Authors:Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin
Title: GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing
Abstract:
Human perception for effective object tracking in 2D video streams arises from the implicit use of prior 3D knowledge and semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings, while neglecting 3D geometric cues, making them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT‑Edit, an online cross‑modality model editing approach that integrates geometry‑aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre‑trained Visual Geometry Grounded Transformer to infer geometric cues from only a few 2D images. To address the challenge of seamlessly combining geometry and semantics, GOT‑Edit performs online model editing. By leveraging null‑space constraints during model updates, it incorporates geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT‑Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking. The project page is available at https://chenshihfang.github.io/GOT‑EDIT.

Authors:Suizhi Huang, Mei Li, Han Yu, Xiaoxiao Li
Title: TextResNet: Decoupling and Routing Optimization Signals in Compound AI Systems via Deep Residual Tuning
Abstract:
Textual Gradient‑style optimizers (TextGrad) enable gradient‑like feedback propagation through compound AI systems. However, they do not work well for deep chains. The root cause of this limitation stems from the Semantic Entanglement problem in these extended workflows. In standard textual backpropagation, feedback signals mix local critiques with upstream contexts, leading to Attribution Ambiguity. To address this challenge, we propose TextResNet, a framework that reformulates the optimization process to achieve precise signal routing via four key innovations. Firstly, in the forward pass, it enforces Additive Semantic Deltas to preserve an Identity Highway for gradient flow. Secondly, in the backward pass, it introduces Semantic Gradient Decomposition via a Semantic Projector to disentangle feedback into causally independent subspaces. Thirdly, it implements Causal Routing, which routes projected signals to their specific components. Finally, it performs Density‑Aware Optimization Scheduling to leverage the disentangled signals to dynamically allocate resources to key system bottlenecks. Our results show that TextResNet not only achieves superior performance compared to TextGrad, but also exhibits remarkable stability for agentic tasks in compound AI systems where baselines collapse. Code is available at https://github.com/JeanDiable/TextResNet.

Authors:Jinwoo Kim, Sékou-Oumar Kaba, Jiyun Park, Seunghoon Hong, Siamak Ravanbakhsh
Title: Inverting Data Transformations via Diffusion Sampling
Abstract:
We study the problem of transformation inversion on general Lie groups: a datum is transformed by an unknown group element, and the goal is to recover an inverse transformation that maps it back to the original data distribution. Such unknown transformations arise widely in machine learning and scientific modeling, where they can significantly distort observations. We take a probabilistic view and model the posterior over transformations as a Boltzmann distribution defined by an energy function on data space. To sample from this posterior, we introduce a diffusion process on Lie groups that keeps all updates on‑manifold and only requires computations in the associated Lie algebra. Our method, Transformation‑Inverting Energy Diffusion (TIED), relies on a new trivialized target‑score identity that enables efficient score‑based sampling of the transformation posterior. As a key application, we focus on test‑time equivariance, where the objective is to improve the robustness of pretrained neural networks to input transformations. Experiments on image homographies and PDE symmetries demonstrate that TIED can restore transformed inputs to the training distribution at test time, showing improved performance over strong canonicalization and sampling baselines. Code is available at https://github.com/jw9730/tied.

Authors:Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Title: SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Abstract:
Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory‑based methods primarily store raw trajectories, which are often redundant and noise‑heavy. This prevents agents from extracting high‑level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience‑based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task‑specific heuristics, and a recursive evolution mechanism that allows the skill library to co‑evolve with the agent's policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search‑augmented tasks demonstrate that SkillRL achieves state‑of‑the‑art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases. Code is available at this https://github.com/aiming‑lab/SkillRL.

Authors:Shingo Higashiguchi, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai
Title: Interpretable Dynamic Network Modeling of Tensor Time Series via Kronecker Time-Varying Graphical Lasso
Abstract:
With the rapid development of web services, large amounts of time series data are generated and accumulated across various domains such as finance, healthcare, and online platforms. As such data often co‑evolves with multiple variables interacting with each other, estimating the time‑varying dependencies between variables (i.e., the dynamic network structure) has become crucial for accurate modeling. However, real‑world data is often represented as tensor time series with multiple modes, resulting in large, entangled networks that are hard to interpret and computationally intensive to estimate. In this paper, we propose Kronecker Time‑Varying Graphical Lasso (KTVGL), a method designed for modeling tensor time series. Our approach estimates mode‑specific dynamic networks in a Kronecker product form, thereby avoiding overly complex entangled structures and producing interpretable modeling results. Moreover, the partitioned network structure prevents the exponential growth of computational time with data dimension. In addition, our method can be extended to stream algorithms, making the computational time independent of the sequence length. Experiments on synthetic data show that the proposed method achieves higher edge estimation accuracy than existing methods while requiring less computation time. To further demonstrate its practical value, we also present a case study using real‑world data. Our source code and datasets are available at https://github.com/Higashiguchi‑Shingo/KTVGL.

Authors:Konstantinos Mitsides, Maxence Faldor, Antoine Cully
Title: Dreaming in Code for Curriculum Learning in Open-Ended Worlds
Abstract:
Open‑ended learning frames intelligence as emerging from continual interaction with an ever‑expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open‑ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, "dreaming" takes the form of materializing code‑level variations of the world. We instantiate DiCode in Craftax, a challenging open‑ended benchmark characterized by rich mechanics and long‑horizon progression. Empirically, DiCode enables agents to acquire long‑horizon skills, achieving a 16% improvement in mean return over the strongest baseline and non‑zero success on late‑game combat tasks where prior methods fail. Our results suggest that code‑level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments that bridge competence gaps in open‑ended worlds. Project page and source code are available at https://konstantinosmitsides.github.io/dreaming‑in‑code and https://github.com/konstantinosmitsides/dreaming‑in‑code.

Authors:Zejia You, Chunyuan Deng, Hanjie Chen
Title: Spherical Steering: Geometry-Aware Activation Rotation for Language Models
Abstract:
Inference‑time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open‑ended generation capabilities. In this work, we explore Spherical Steering, a training‑free primitive that resolves this trade‑off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple‑choice benchmarks demonstrate that Spherical Steering significantly outperforms addition‑based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open‑ended generation quality. This work highlights the value of geometric consistency, suggesting that norm‑preserving rotation is a robust and effective primitive for precise inference‑time control.

Authors:H. Martin Gillis, Isaac Xu, Thomas Trappenberg
Title: Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation
Abstract:
Machine learning applications require fast and reliable per‑sample uncertainty estimation. A common approach is to use predictive distributions from Bayesian or approximation methods and additively decompose uncertainty into aleatoric (i.e., data‑related) and epistemic (i.e., model‑related) components. However, additive decomposition has recently been questioned, with evidence that it breaks down when using finite‑ensemble sampling and/or mismatched predictive distributions. This paper introduces Variance‑Gated Ensembles (VGE), an intuitive, differentiable framework that injects epistemic sensitivity via a signal‑to‑noise gate computed from ensemble statistics. VGE provides: (i) a Variance‑Gated Margin Uncertainty (VGMU) score that couples decision margins with ensemble predictive variance; and (ii) a Variance‑Gated Normalization (VGN) layer that generalizes the variance‑gated uncertainty mechanism to training via per‑class, learnable normalization of ensemble member probabilities. We derive closed‑form vector‑Jacobian products enabling end‑to‑end training through ensemble sample mean and variance. VGE matches or exceeds state‑of‑the‑art information‑theoretic baselines while remaining computationally efficient. As a result, VGE provides a practical and scalable approach to epistemic‑aware uncertainty estimation in ensemble models. An open‑source implementation is available at: https://github.com/nextdevai/vge.

Authors:Sidike Paheding, Abel Reyes-Angulo, Leo Thomas Ramos, Angel D. Sappa, Rajaneesh A., Hiral P. B., Sajin Kumar K. S., Thomas Oommen
Title: MMLSv2: A Multimodal Dataset for Martian Landslide Detection in Remote Sensing Imagery
Abstract:
We present MMLSv2, a dataset for landslide segmentation on Martian surfaces. MMLSv2 consists of multimodal imagery with seven bands: RGB, digital elevation model, slope, thermal inertia, and grayscale channels. MMLSv2 comprises 664 images distributed across training, validation, and test splits. In addition, an isolated test set of 276 images from a geographically disjoint region from the base dataset is released to evaluate spatial generalization. Experiments conducted with multiple segmentation models show that the dataset supports stable training and achieves competitive performance, while still posing challenges in fragmented, elongated, and small‑scale landslide regions. Evaluation on the isolated test set leads to a noticeable performance drop, indicating increased difficulty and highlighting its value for assessing model robustness and generalization beyond standard in‑distribution settings. Dataset will be available at: https://github.com/MAIN‑Lab/MMLS_v2

Authors:Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang
Title: SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
Abstract:
Modern Transformers predominantly adopt the Pre‑Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post‑Norm architecture. Prior attempts to combine their strengths typically lead to a stability‑performance trade‑off. We attribute this phenomenon to a structural incompatibility within a single‑stream design: Any application of the Post‑Norm operation inevitably obstructs the clean identity gradient preserved by Pre‑Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two‑stream architecture that couples Pre‑Norm‑like and Post‑Norm‑like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre‑Norm and Post‑Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre‑training experiments on 1.3B‑parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen‑Applications/SiameseNorm.

Authors:Jingtao Liu, Xinming Zhang
Title: TAAM:Inductive Graph-Class Incremental Learning with Task-Aware Adaptive Modulation
Abstract:
Graph Continual Learning (GCL) aims to solve the challenges of streaming graph data. However, current methods often depend on replay‑based strategies, which raise concerns like memory limits and privacy issues, while also struggling to resolve the stability‑plasticity dilemma. In this paper, we suggest that lightweight, task‑specific modules can effectively guide the reasoning process of a fixed GNN backbone. Based on this idea, we propose Task‑Aware Adaptive Modulation (TAAM). The key component of TAAM is its lightweight Neural Synapse Modulators (NSMs). For each new task, a dedicated NSM is trained and then frozen, acting as an "expert module." These modules perform detailed, node‑attentive adaptive modulation on the computational flow of a shared GNN backbone. This setup ensures that new knowledge is kept within compact, task‑specific modules, naturally preventing catastrophic forgetting without using any data replay. Additionally, to address the important challenge of unknown task IDs in real‑world scenarios, we propose and theoretically prove a novel method named Anchored Multi‑hop Propagation (AMP). Notably, we find that existing GCL benchmarks have flaws that can cause data leakage and biased evaluations. Therefore, we conduct all experiments in a more rigorous inductive learning scenario. Extensive experiments show that TAAM comprehensively outperforms state‑of‑the‑art methods across eight datasets. Code and Datasets are available at: https://github.com/1iuJT/TAAM_AAMAS2026.

Authors:Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, Shie Mannor
Title: Horizon Imagination: Efficient On-Policy Rollout in Diffusion World Models
Abstract:
We study diffusion‑based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on‑policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub‑frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub‑frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is available at https://github.com/leor‑c/horizon‑imagination.

Authors:Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, Zhuotao Tian
Title: FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
Abstract:
Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training‑free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity‑based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree‑based Spatiotemporal Token Merging (TSTM) for fine‑grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA‑OneVision. Consequently, FlashVID can serve as a training‑free and plug‑and‑play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5‑VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang‑v/FlashVID.

Authors:Sizhe Dang, Jiaqi Shao, Xiaodong Zheng, Guang Dai, Yan Song, Haishan Ye
Title: From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency
Abstract:
As foundation models continue to scale, pretraining increasingly relies on data‑parallel distributed optimization, making bandwidth‑limited gradient synchronization a key bottleneck. Orthogonally, projection‑based low‑rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication‑limited training: one‑sided synchronization still transmits an O(rn) object for an m× n matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two‑sided low‑rank communication to Adam‑family updates (TSR‑Adam) by synchronizing a compact core U^\top G V\in\mathbbR^r× r, reducing the dominant per‑step payload from O(mn) to O(r^2) while keeping moment states in low‑dimensional cores. To further reduce the peak communication from subspace refresh, TSR‑Adam adopts a randomized SVD‑based refresh that avoids full‑gradient synchronization. We additionally extend low‑rank communication to embedding gradients with embedding‑specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR‑Adam reduces average communicated bytes per step by 13×, and on GLUE fine‑tuning it reduces communication by 25×, while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR‑Adam.

Authors:Huiyang Yi, Xiaojian Shen, Yonggang Wu, Duxin Chen, He Wang, Wenwu Yu
Title: CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios
Abstract:
Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness‑oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time‑series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption‑violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning‑based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS‑NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real‑world applications. The code and datasets are available at https://github.com/huiyang‑yi/CausalCompass.

Authors:Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang
Title: SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
Abstract:
As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model‑item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task‑aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~τ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real‑world scenarios. Code is available at https://github.com/taolinzhang/SparseEval.

Authors:Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Di Jin, Siheng Chen
Title: AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Abstract:
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM‑based agents show promise, current prompt‑based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace‑30B model achieves a 100% valid submission rate on MLE‑Bench‑Lite, approaches the performance of proprietary frontier models, and outperforms larger open‑source baselines (e.g., DeepSeek‑V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu‑cai/AceGRPO.

Authors:Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong
Title: Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
Abstract:
Large Language Models (LLMs) often incur an alignment tax: safety post‑training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual‑learning‑style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre‑trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first‑order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low‑rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety‑directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug‑and‑play and integrates into standard post‑training pipelines without large‑scale replay, auxiliary objectives, or retraining. Across Supervised Fine‑Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT\rightarrowDPO settings, OGPSA consistently improves the safety‑‑utility Pareto frontier over standard baselines. For instance, on Qwen2.5‑7B‑Instruct under SFT\rightarrowDPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53% to 3.03% and IFEval from 51.94% to 63.96%. Our source code is available at \hrefhttps://github.com/SunGL001/OGPSAOGPSA

Authors:Aditya Shankar, Yuandou Wang, Rihan Hai, Lydia Y. Chen
Title: Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion
Abstract:
Generating tabular data under conditions is critical to applications requiring precise control over the generative process. Existing methods rely on training‑time strategies that do not generalise to unseen constraints during inference, and struggle to handle conditional tasks beyond tabular imputation. While manifold theory offers a principled way to guide generation, current formulations are tied to specific inference‑time objectives and are limited to continuous domains. We extend manifold theory to tabular data and expand its scope to handle diverse inference‑time objectives. On this foundation, we introduce HARPOON, a tabular diffusion method that guides unconstrained samples along the manifold geometry to satisfy diverse tabular conditions at inference. We validate our theoretical contributions empirically on tasks such as imputation and enforcing inequality constraints, demonstrating HARPOON'S strong performance across diverse datasets and the practical benefits of manifold‑aware guidance for tabular data. Code URL: https://github.com/adis98/Harpoon

Authors:Ruiqi Wang, Ruikang Liu, Runyu Chen, Haoxiang Suo, Zhiyi Peng, Zhuo Tang, Changjian Chen
Title: CausalTAD: Injecting Causal Knowledge into Large Language Models for Tabular Anomaly Detection
Abstract:
Detecting anomalies in tabular data is critical for many real‑world applications, such as credit card fraud detection. With the rapid advancements in large language models (LLMs), state‑of‑the‑art performance in tabular anomaly detection has been achieved by converting tabular data into text and fine‑tuning LLMs. However, these methods randomly order columns during conversion, without considering the causal relationships between them, which is crucial for accurately detecting anomalies. In this paper, we present CausalTaD, a method that injects causal knowledge into LLMs for tabular anomaly detection. We first identify the causal relationships between columns and reorder them to align with these causal relationships. This reordering can be modeled as a linear ordering problem. Since each column contributes differently to the causal relationships, we further propose a reweighting strategy to assign different weights to different columns to enhance this effect. Experiments across more than 30 datasets demonstrate that our method consistently outperforms the current state‑of‑the‑art methods. The code for CausalTAD is available at https://github.com/350234/CausalTAD.

Authors:Qiuming Luo, Yuebing Li, Feng Li, Chang Kong
Title: PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification
Abstract:
Distilling knowledge from large Vision‑Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine‑Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt‑Aware Neighborhood Distillation), a two‑stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt‑Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood‑aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state‑of‑the‑art methods on four FGVC benchmarks. Notably, our ResNet‑18 student achieves 76.09% accuracy on CUB‑200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.

Authors:Jarrod Barnes
Title: Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation
Abstract:
Test‑time training (TTT) adapts language models through gradient‑based updates at inference. But is adaptation the right strategy? We study compute‑optimal test‑time strategies for verifiable execution‑grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B‑parameter model (GPT‑OSS‑120B with LoRA adaptation), we find that search outperforms minimal adaptation (1‑5 gradient steps): Best‑of‑N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3‑seed mean), with TTT's "equivalent K" falling below 1, worse than single‑sample inference. The failure mode is over‑sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal‑guided selection: selecting the highest‑surprisal (lowest‑confidence) correct sample yields 80% success vs. 50% for most‑confident selection, a 30% improvement. Extending to surprisal‑guided‑top3 matches oracle performance at 100%. This zero‑cost strategy, validated through length‑controlled analysis, recovers oracle performance. For dense‑reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal‑guided selection principle may generalize to other execution‑grounded domains where optimal solutions occupy the distribution tail.

Authors:Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan
Title: SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models
Abstract:
Mixture‑of‑Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory‑bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity‑based Expert Re‑routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input‑aware manner by re‑routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch‑level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug‑and‑play use in vLLM with only a single‑line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost‑efficient and latency‑sensitive large‑scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL‑Cheng/SERE.

Authors:Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth
Title: Evaluating Object-Centric Models beyond Object Discovery
Abstract:
Object‑centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out‑of‑distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction‑tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi‑feature reconstruction baseline as a reference point.

Authors:Jianwen Chen, Xinyu Yang, Peng Xia, Arian Azarang, Yueh Z Lee, Gang Li, Hongtu Zhu, Yun Li, Beidi Chen, Huaxiu Yao
Title: MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution
Abstract:
Large language models (LLMs) have demonstrated strong performance and rapid progress in a wide range of medical reasoning tasks. However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems. To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri net theory. The framework adopts a full‑stack design across data, model architecture, and system execution. For data creation, we introduce the MedVerse Curator, an automated pipeline that synthesizes knowledge‑grounded medical reasoning paths and transforms them into Petri net‑structured representations. At the architectural level, we propose a topology‑aware attention mechanism with adaptive position indices that supports parallel reasoning while preserving logical consistency. Systematically, we develop a customized inference engine that supports parallel execution without additional overhead. Empirical evaluations show that MedVerse improves strong general‑purpose LLMs by up to 8.9%. Compared to specialized medical LLMs, MedVerse achieves comparable performance while delivering a 1.3x reduction in inference latency and a 1.7x increase in generation throughput, enabled by its parallel decoding capability. Code is available at https://github.com/aiming‑lab/MedVerse.

Authors:Martin Fixman, Alessandro Abati, Julián Jiménez Nimmo, Sean Lim, Esther Mondragón
Title: PALMS: Pavlovian Associative Learning Models Simulator
Abstract:
Simulations are an indispensable step in the cycle of theory development and refinement, helping researchers formulate precise definitions, generate models, and make accurate predictions. This paper introduces the Pavlovian Associative Learning Models Simulator (PALMS), a Python environment to simulate Pavlovian conditioning experiments. In addition to the canonical Rescorla‑Wagner model, PALMS incorporates several attentional learning approaches, including Pearce‑Kaye‑Hall, Mackintosh Extended, Le Pelley's Hybrid, and a novel extension of the Rescorla‑Wagner model with a unified variable learning rate that integrates Mackintosh's and Pearce and Hall's opposing conceptualisations. The simulator's graphical interface allows for the input of entire experimental designs in an alphanumeric format, akin to that used by experimental neuroscientists. Moreover, it uniquely enables the simulation of experiments involving hundreds of stimuli, as well as the computation of configural cues and configural‑cue compounds across all models, thereby considerably expanding their predictive capabilities. PALMS operates efficiently, providing instant visualisation of results, supporting rapid, precise comparisons of various models' predictions within a single architecture and environment. Furthermore, graphic displays can be easily saved, and simulated data can be exported to spreadsheets. To illustrate the simulator's capabilities and functionalities, we provide a detailed description of the software and examples of use, reproducing published experiments in the associative learning literature. PALMS is licensed under the open‑source GNU Lesser General Public License 3.0. The simulator source code and the latest multiplatform release build are accessible as a GitHub repository at https://github.com/cal‑r/PALMS‑Simulator

Authors:Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang
Title: Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise
Abstract:
While adaptive gradient methods are the workhorse of modern machine learning, sign‑based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign‑based updates outperform variance‑adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy‑tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy‑tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best‑known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy‑tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign‑based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well‑aligned with practice.

Authors:Yicheng Yang, Ruijiao Li, Lifeng Wang, Shuai Zheng, Shunzheng Ma, Keyu Zhang, Tuoyu Sun, Chenyun Dai, Jie Ding, Zhuo Zou
Title: Scalable Dexterous Robot Learning with AR-based Remote Human-Robot Interactions
Abstract:
This paper focuses on the scalable robot learning for manipulation in the dexterous robot arm‑hand systems, where the remote human‑robot interactions via augmented reality (AR) are established to collect the expert demonstration data for improving efficiency. In such a system, we present a unified framework to address the general manipulation task problem. Specifically, the proposed method consists of two phases: i) In the first phase for pretraining, the policy is created in a behavior cloning (BC) manner, through leveraging the learning data from our AR‑based remote human‑robot interaction system; ii) In the second phase, a contrastive learning empowered reinforcement learning (RL) method is developed to obtain more efficient and robust policy than the BC, and thus a projection head is designed to accelerate the learning progress. An event‑driven augmented reward is adopted for enhancing the safety. To validate the proposed method, both the physics simulations via PyBullet and real‑world experiments are carried out. The results demonstrate that compared to the classic proximal policy optimization and soft actor‑critic policies, our method not only significantly speeds up the inference, but also achieves much better performance in terms of the success rate for fulfilling the manipulation tasks. By conducting the ablation study, it is confirmed that the proposed RL with contrastive learning overcomes policy collapse. Supplementary demonstrations are available at https://cyberyyc.github.io/.

Authors:Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy
Title: RAPiD: Real-time Deterministic Trajectory Planning via Diffusion Behavior Priors for Safe and Efficient Autonomous Driving
Abstract:
Diffusion‑based trajectory planners have demonstrated strong capability for modeling the multimodal nature of human driving behavior, but their reliance on iterative stochastic sampling poses critical challenges for real‑time, safety‑critical deployment. In this work, we present RAPiD, a deterministic policy extraction framework that distills a pretrained diffusion‑based planner into an efficient policy while eliminating diffusion sampling. Using score‑regularized policy optimization, we leverage the score function of a pre‑trained diffusion planner as a behavior prior to regularize policy learning. To promote safety and passenger comfort, the policy is optimized using a critic trained to imitate a predictive driver controller, providing dense, safety‑focused supervision beyond conventional imitation learning. Evaluations demonstrate that RAPiD achieves competitive performance on closed‑loop nuPlan scenarios with an 8x speedup over diffusion baselines, while achieving state‑of‑the‑art generalization among learning‑based planners on the interPlan benchmark. The official website of this work is: https://github.com/ruturajreddy/RAPiD.

Authors:Ayush Roy, Rudrasis Chakraborty, Lav Varshney, Vishnu Suresh Lokhande
Title: Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity
Abstract:
Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero‑shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta‑distributions, which are also extended to non‑Gaussian and multimodal real‑world settings. Most importantly, we show that these improvements translate to zero‑shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond‑Pooling.

Authors:Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, Yong Jae Lee
Title: Reasoning-Augmented Representations for Multimodal Retrieval
Abstract:
Universal Multimodal Retrieval (UMR) seeks any‑to‑any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data‑induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data‑centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision‑‑Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference‑time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M‑BEIR, our reasoning‑augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge‑intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.

Authors:Gyoung S. Na, Chanyoung Park
Title: Electron-Informed Coarse-Graining Molecular Representation Learning for Real-World Molecular Physics
Abstract:
Various representation learning methods for molecular structures have been devised to accelerate data‑driven chemistry. However, the representation capabilities of existing methods are essentially limited to atom‑level information, which is not sufficient to describe real‑world molecular physics. Although electron‑level information can provide fundamental knowledge about chemical compounds beyond the atom‑level information, obtaining the electron‑level information in real‑world molecules is computationally impractical and sometimes infeasible. We propose a method for learning electron‑informed molecular representations without additional computation costs by transferring readily accessible electron‑level information about small molecules to large molecules of our interest. The proposed method achieved state‑of‑the‑art prediction accuracy on extensive benchmark datasets containing experimentally observed molecular physics. The source code for HEDMoL is available at https://github.com/ngs00/HEDMoL.

Authors:Vladimer Khasia
Title: Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures
Abstract:
Standard Transformer architectures rely heavily on dense linear transformations, treating feature projection as a monolithic, full‑rank operation. We argue that this formulation is inefficient and lacks the structural inductive bias necessary for distinguishing between local feature preservation and global context integration. To address this, we introduce the Hybrid Dual‑Path Linear (HDPL) operator, which decomposes the affine transformation into two topologically distinct pathways: a sparse block‑diagonal component for high‑rank local processing, and a low‑rank Variational Autoencoder (VAE) bottleneck for global context regularization. By "surgically" replacing specific projections (Query, Key, Value, Gate, Up) with HDPL operators while retaining standard dense layers for aggregation (Output, Down), we achieve a superior balance of efficiency and representational power. Experiments on the FineWeb‑Edu dataset demonstrate that the HDPL architecture outperforms a standard Llama‑style baseline, reducing validation loss while simultaneously reducing parameter count by 6.8%. Beyond immediate performance gains, we discuss how the explicit materialization of a probabilistic latent space within the Transformer backbone serves as a vital architectural affordance, offering new pathways for inference‑time or hypernetwork induced control, continual adaptation, interpretability, and cross‑model or cross‑modal synchronization. The code is available at https://github.com/VladimerKhasia/HDPL

Authors:Shashank
Title: Attractor Patch Networks: Reducing Catastrophic Forgetting with Routed Low-Rank Patch Experts
Abstract:
Transformers achieve strong language modeling accuracy, yet their position‑wise feed‑forward networks (FFNs) are dense, globally shared, and typically updated end to end. These properties create two practical tensions. First, dense FFNs spend the same compute on every token regardless of context, and they allocate capacity uniformly even when language exhibits highly clustered context structure. Second, continual learning, in the sense of updating the model while serving a data stream, often produces interference because a small update touches broadly shared weights. We propose Attractor Patch Networks (APN), a plug‑compatible replacement for the Transformer FFN. APN is a bank of patch experts. A similarity router selects a small top‑k set of patches for each token by matching the token representation to learned prototypes. Each selected patch emits a low‑rank residual update conditioned on a compact code. The architecture yields conditional, context‑specialized nonlinear transformations while preserving the standard Transformer interface. This paper focuses on APN as an architectural primitive. We formalize APN, analyze its expressivity as a piecewise low‑rank residual function class, and derive simple interference and stability arguments that make APN naturally compatible with continual learning. In experiments on character‑level language modeling, APN achieves competitive perplexity (4.57 vs 4.32 PPL) while enabling dramatically better continual adaptation: when adapting to a shifted domain, APN achieves 2.6 times better retention (11.1 vs 29.4 PPL on the original domain) and 2.8 times better adaptation (6.4 vs 17.8 PPL on the new domain) compared to global fine‑tuning of a dense FFN baseline.

Authors:Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko
Title: Vision Transformer Finetuning Benefits from Non-Smooth Components
Abstract:
The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emphplasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies a low smoothness. Our theoretical analysis and extensive experiments ‑‑ over 1,000 finetuning runs on large‑scale vision transformers ‑‑ showcase that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on transformers' functional properties. The code is available at https://github.com/ambroiseodt/vit‑plasticity.

Authors:Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich
Title: Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization
Abstract:
Adaptive methods like Adam have become the de facto standard for large‑scale vector and Euclidean optimization due to their coordinate‑wise adaptation with a second‑order nature. More recently, matrix‑based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale‑invariant term. This decoupling produces DeVA (Decoupled Variance Adaptation), a framework that bridges between vector‑based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state‑of‑the‑art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled‑Variance‑Adaptation

Authors:Joao Baptista Cardia Neto, Claudio Ferrari, Stefano Berretti
Title: Revisiting Emotions Representation for Recognition in the Wild
Abstract:
Facial emotion recognition has been typically cast as a single‑label classification problem of one out of six prototypical emotions. However, that is an oversimplification that is unsuitable for representing the multifaceted spectrum of spontaneous emotional states, which are most often the result of a combination of multiple emotions contributing at different intensities. Building on this, a promising direction that was explored recently is to cast emotion recognition as a distribution learning problem. Still, such approaches are limited in that research datasets are typically annotated with a single emotion class. In this paper, we contribute a novel approach to describe complex emotional states as probability distributions over a set of emotion classes. To do so, we propose a solution to automatically re‑label existing datasets by exploiting the result of a study in which a large set of both basic and compound emotions is mapped to probability distributions in the Valence‑Arousal‑Dominance (VAD) space. In this way, given a face image annotated with VAD values, we can estimate the likelihood of it belonging to each of the distributions, so that emotional states can be described as a mixture of emotions, enriching their description, while also accounting for the ambiguous nature of their perception. In a preliminary set of experiments, we illustrate the advantages of this solution and a new possible direction of investigation. Data annotations are available at https://github.com/jbcnrlz/affectnet‑b‑annotation.

Authors:Mingxi Xu, Qi Wang, Zhengyu Wen, Phong Dao Thien, Zhengyu Li, Ning Zhang, Xiaoyu He, Wei Zhao, Kehong Gong, Mingyuan Zhang
Title: NECromancer: Breathing Life into Skeletons via BVH Animation
Abstract:
Motion tokenization is a key component of generalizable motion models, yet most existing approaches are restricted to species‑specific skeletons, limiting their applicability across diverse morphologies. We propose NECromancer (NEC), a universal motion tokenizer that operates directly on arbitrary BVH skeletons. NEC consists of three components: (1) an Ontology‑aware Skeletal Graph Encoder (OwO) that encodes structural priors from BVH files, including joint semantics, rest‑pose offsets, and skeletal topology, into skeletal embeddings; (2) a Topology‑Agnostic Tokenizer (TAT) that compresses motion sequences into a universal, topology‑invariant discrete representation; and (3) the Unified BVH Universe (UvU), a large‑scale dataset aggregating BVH motions across heterogeneous skeletons. Experiments show that NEC achieves high‑fidelity reconstruction under substantial compression and effectively disentangles motion from skeletal structure. The resulting token space supports cross‑species motion transfer, composition, denoising, generation with token‑based models, and text‑motion retrieval, establishing a unified framework for motion analysis and synthesis across diverse morphologies. Demo page: https://animotionlab.github.io/NECromancer/

Authors:Daisuke Oba, Hiroki Furuta, Naoaki Okazaki
Title: Diffusion-State Policy Optimization for Masked Diffusion Language Models
Abstract:
Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion‑State Policy Optimization), a plug‑in credit‑assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout‑cached logits, scores the resulting completions, and updates only the newly filled tokens ‑‑ without additional multi‑step diffusion rollouts. We formalize a fixed‑state objective for branched completions and derive a policy‑gradient estimator that can be combined with terminal‑feedback policy optimization using the same rollouts. On LLaDA‑8B‑Instruct, DiSPO consistently improves over the terminal‑feedback diffu‑GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .

Authors:Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki
Title: Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
Abstract:
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed‑forward blocks for every token position at every step ‑‑ even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position ‑‑ thereafter skipping its query projection and feed‑forward sublayers ‑‑ while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per‑iteration computational cost from O(N^2d) to O(MNd) where N is the sequence length, M is the number of unlocked token positions, and d is the model dimension. In practice, M decreases as the iteration progresses, yielding substantial savings. On LLaDA‑8B, SureLock reduces algorithmic FLOPs by 30‑‑50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .

Authors:Junqi Chen, Sirui Chen, Chaochao Lu
Title: Can Post-Training Transform LLMs into Causal Reasoners?
Abstract:
Causal inference is essential for decision‑making but remains challenging for non‑experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post‑training on these abilities is insufficiently explored. This paper examines the extent to which post‑training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post‑training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in‑domain and four existing benchmarks, our experiments demonstrate that appropriate post‑training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post‑trained LLMs exhibit strong generalization and robustness under real‑world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post‑training can produce reliable and robust LLM‑based causal reasoners. Our data and GRPO‑model are available at https://github.com/OpenCausaLab/CauGym.

Authors:Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, Anshumali Shrivastava
Title: SOCKET: SOft Collison Kernel EsTimator for Sparse Attention
Abstract:
Exploiting sparsity during long‑context inference is central to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends critically on efficient scoring and selection of relevant tokens at inference time. We revisit Locality‑Sensitive Hashing (LSH) as a sparsification primitive and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity‑aware aggregation. Our key insight is that hard LSH produces discrete collision signals and is therefore poorly suited for ranking. In contrast, soft LSH aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among the true top‑k tokens. This transformation elevates LSH from a candidate‑generation heuristic to a principled and mathematically grounded scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad‑hoc voting mechanism, and matches or surpasses established sparse attention baselines across multiple long‑context benchmarks using diverse set of models. With a custom CUDA kernel for scoring keys and a Flash Decode Triton backend for sparse attention, SOCKET achieves up to 1.5× higher throughput than FlashAttention, making it an effective tool for long‑context inference. Code is open‑sourced at https://github.com/amarka8/SOCKET.

Authors:Patryk Rybak, Paweł Batorski, Paul Swoboda, Przemysław Spurek
Title: REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop
Abstract:
Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits ``forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating performance across a diverse suite of unlearning algorithms. Our experiments show that REBEL consistently outperforms static baselines, recovering ``forgotten'' knowledge with Attack Success Rates (ASRs) reaching up to 60% on TOFU and 93% on WMDP. We will make all code publicly available upon acceptance. Code is available at https://github.com/patryk‑rybak/REBEL/

Authors:Yu Zhang, Sean Bin Yang, Arijit Khan, Cuneyt Gurcan Akcora
Title: ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks
Abstract:
Counterfactual explanations offer an intuitive way to interpret graph neural networks (GNNs) by identifying minimal changes that alter a model's prediction, thereby answering "what must differ for a different outcome?". In this work, we propose a novel framework, ATEX‑CF that unifies adversarial attack techniques with counterfactual explanation generation‑a connection made feasible by their shared goal of flipping a node's prediction, yet differing in perturbation strategy: adversarial attacks often rely on edge additions, while counterfactual methods typically use deletions. Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and deletions, grounded in theory, leveraging adversarial insights to explore impactful counterfactuals. In addition, by jointly optimizing fidelity, sparsity, and plausibility under a constrained perturbation budget, our method produces instance‑level explanations that are both informative and realistic. Experiments on synthetic and real‑world node classification benchmarks demonstrate that ATEX‑CF generates faithful, concise, and plausible explanations, highlighting the effectiveness of integrating adversarial insights into counterfactual reasoning for GNNs.

Authors:Peiyang Song, Pengrui Han, Noah Goodman
Title: Large Language Model Reasoning Failures
Abstract:
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non‑embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application‑specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang‑Song/Awesome‑LLM‑Reasoning‑Failures, to provide an easy entry point to this area.

Authors:Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Leandro Giusti Mugnaini, Keith Ando Ogawa, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao
Title: Compressing LLMs with MoP: Mixture of Pruners
Abstract:
The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a single dimension‑depth or width. We introduce MoP (Mixture of Pruners), an iterative framework that unifies these dimensions. At each iteration, MoP generates two branches‑pruning in depth versus pruning in width‑and selects a candidate to advance the path. On LLaMA‑2 and LLaMA‑3, MoP advances the frontier of structured pruning, exceeding the accuracy of competing methods across a broad set of compression regimes. It also consistently outperforms depth‑only and width‑only pruning. Furthermore, MoP translates structural pruning into real speedup, reducing end‑to‑end latency by 39% at 40% compression. Finally, extending MoP to the vision‑language model LLaVA‑1.5, we notably improve computational efficiency and demonstrate that text‑only recovery fine‑tuning can restore performance even on visual tasks.

Authors:José Ramón Pareja Monturiol, Juliette Sinnott, Roger G. Melko, Mohammad Kohandel
Title: Private and interpretable clinical prediction with quantum-inspired tensor train models
Abstract:
Machine learning in clinical settings must balance predictive accuracy, interpretability, and privacy. Models such as logistic regression (LR) offer transparency, while neural networks (NNs) provide greater predictive power; yet both remain vulnerable to privacy attacks. We empirically assess these risks by designing attacks that identify which public datasets were used to train a model under varying levels of adversarial access, applying them to LORIS, a publicly available LR model for immunotherapy response prediction, as well as to additional shallow NN models trained for the same task. Our results show that both models leak significant training‑set information, with LRs proving particularly vulnerable in white‑box scenarios. Moreover, we observe that common practices such as cross‑validation in LRs exacerbate these risks. To mitigate these vulnerabilities, we propose a quantum‑inspired defense based on tensorizing discretized models into tensor trains (TTs), which fully obfuscates parameters while preserving accuracy, reducing white‑box attacks to random guessing and degrading black‑box attacks comparably to Differential Privacy. TT models retain LR interpretability and extend it through efficient computation of marginal and conditional distributions, while also enabling this higher level of interpretability for NNs. Our results demonstrate that tensorization is widely applicable and establishes a practical foundation for private, interpretable, and effective clinical prediction.

Authors:Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
Title: Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Abstract:
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query‑agnostic memory construction that can be inefficient and may discard query‑critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance‑cost trade‑off. In this work, we present BudgetMem, a runtime agent memory framework for explicit, query‑aware performance‑cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textscLow/\textscMid/\textscHigh). A lightweight router performs budget‑tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high‑budget setting), and delivers better accuracy‑cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade‑offs under varying budget regimes.

Authors:Xianyang Liu, Shangding Gu, Dawn Song
Title: AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions
Abstract:
Large language model (LLM)‑based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language‑mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi‑agent buyer‑seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product‑dependent valuations, and must reach agreements through multi‑round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many‑to‑many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state‑of‑the‑art proprietary and open‑weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long‑horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language‑based market interaction. Code and dataset are available at the link: https://github.com/SafeRL‑Lab/AgenticPay.

Authors:Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao
Title: Layer-wise LoRA fine-tuning: a similarity metric approach
Abstract:
Pre‑training Large Language Models (LLMs) on web‑scale datasets becomes fundamental for advancing general‑purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine‑tuning. Parameter‑efficient fine‑tuning techniques, such as Low‑Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre‑trained model and updating a smaller number of parameters. In comparison to full fine‑tuning, these methods achieve over 99% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine‑tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine‑tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low‑rank adaptation techniques. We reduce the trainable parameters in LoRA‑based techniques by up to 50%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder‑only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder‑only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem‑solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine‑tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d‑usp/Layer‑wise‑LoRA‑with‑CKA

Authors:Arran Carter, Sanghyeok Choi, Kirill Tamogashev, Víctor Elvira, Nikolay Malkin
Title: Discrete diffusion samplers and bridges: Off-policy algorithms and applications in latent spaces
Abstract:
Sampling from a distribution p(x) \propto e^‑\mathcalE(x) known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous‑space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous‑space sampling. In this paper, we propose to bridge this gap by introducing off‑policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data‑to‑energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data‑free posterior sampling in the discrete latent spaces of image generative models.

Authors:Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim
Title: Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching
Abstract:
Flow matching has recently emerged as a promising alternative to diffusion‑based generative models, particularly for text‑to‑image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text‑to‑image systems. Specifically, we propose learning a condition‑dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text‑to‑image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

Authors:Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
Title: Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training
Abstract:
Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL‑regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed‑form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD‑mean, that approximates the log‑partition term with the mean reward under the sampling policy and performs regression in log‑policy space. Specifically, we characterize the population solution of PMD‑mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL‑‑χ^2 regularizer. This additional χ^2 regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite‑sample estimation errors. Experiments on math reasoning tasks show that PMD‑mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD‑mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon‑rl/OpenKimi.

Authors:András Balogh, Márk Jelasity
Title: Verification of the Implicit World Model in a Generative Model via Adversarial Sequences
Abstract:
Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether ‑‑ or to what extent ‑‑ sample‑based training is able to capture the true structure of these languages, often referred to as the ``world model''. Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule‑based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the falsification of soundness, this method is also suitable for a more fine‑grained analysis of the failure modes and the effects of different choices during training. To demonstrate this, we propose a number of methods for adversarial sequence generation and evaluate the approach on a large set of chess models. We train models on random as well as high‑quality chess games, using several training recipes. We find that none of the models are sound, but some training techniques and dataset choices are able to improve soundness remarkably. We also investigate the potential application of board state probes in both our training and attack methods. Our findings indicate that the extracted board states have no causal role in next token prediction in most of the models.

Authors:Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Sarro Federica, Zhaoyang Chu, He Ye
Title: ContextBench: A Benchmark for Context Retrieval in Coding Agents
Abstract:
LLM‑based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process‑oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue‑resolution tasks from 66 repositories across eight programming languages, each augmented with human‑annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval ("The Bitter Lesson" of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end‑to‑end benchmarks with intermediate gold‑context metrics that unbox the issue‑resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks. Data and code are available at: https://cioutn.github.io/context‑bench/.

Authors:Artem Riabinin, Andrey Veprikov, Arman Bolatov, Martin Takáč, Aleksandr Beznosikov
Title: Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers
Abstract:
We study adaptive learning rate scheduling for norm‑constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm‑up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm‑up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm‑up selection consistently outperforms or at least matches the best manually tuned warm‑up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain‑lab‑research/llm‑baselines/tree/warmup

Authors:Ling Zhan, Zhen Li, Junjie Huang, Tao Jia
Title: Accelerating Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection
Abstract:
Benchmarking the hundreds of functional connectivity (FC) modeling methods on large‑scale fMRI datasets is critical for reproducible neuroscience. However, the combinatorial explosion of model‑data pairings makes exhaustive evaluation computationally prohibitive, preventing such assessments from becoming a routine pre‑analysis step. To break this bottleneck, we reframe the challenge of FC benchmarking by selecting a small, representative core‑set whose sole purpose is to preserve the relative performance ranking of FC operators. We formalize this as a ranking‑preserving subset selection problem and propose Structure‑aware Contrastive Learning for Core‑set Selection (SCLCS), a self‑supervised framework to select these core‑sets. SCLCS first uses an adaptive Transformer to learn each sample's unique FC structure. It then introduces a novel Structural Perturbation Score (SPS) to quantify the stability of these learned structures during training, identifying samples that represent foundational connectivity archetypes. Finally, while SCLCS identifies stable samples via a top‑k ranking, we further introduce a density‑balanced sampling strategy as a necessary correction to promote diversity, ensuring the final core‑set is both structurally robust and distributionally representative. On the large‑scale REST‑meta‑MDD dataset, SCLCS preserves the ground‑truth model ranking with just 10% of the data, outperforming state‑of‑the‑art (SOTA) core‑set selection methods by up to 23.2% in ranking consistency (nDCG@k). To our knowledge, this is the first work to formalize core‑set selection for FC operator benchmarking, thereby making large‑scale operators comparisons a feasible and integral part of computational neuroscience. Code is publicly available on https://github.com/lzhan94swu/SCLCS

Authors:Chenxi Wan, Xunkai Li, Yilong Zuo, Haokun Deng, Sihan Li, Bowen Fan, Hongchao Qin, Ronghua Li, Guoren Wang
Title: OpenMAG: A Comprehensive Benchmark for Multimodal-Attributed Graph
Abstract:
Multimodal‑Attributed Graph (MAG) learning has achieved remarkable success in modeling complex real‑world systems by integrating graph topology with rich attributes from multiple modalities. With the rapid proliferation of novel MAG models capable of handling intricate cross‑modal semantics and structural dependencies, establishing a rigorous and unified evaluation standard has become imperative. Although existing benchmarks have facilitated initial progress, they exhibit critical limitations in domain coverage, encoder flexibility, model diversity, and task scope, presenting significant challenges to fair evaluation. To bridge this gap, we present OpenMAG, a comprehensive benchmark that integrates 19 datasets across 6 domains and incorporates 16 encoders to support both static and trainable feature encoding. OpenMAG further implements a standardized library of 24 state‑of‑the‑art models and supports 8 downstream tasks, enabling fair comparisons within a unified framework. Through systematic assessment of necessity, data quality, effectiveness, robustness, and efficiency, we derive 14 fundamental insights into MAG learning to guide future advancements. Our code is available at https://github.com/YUKI‑N810/OpenMAG.

Authors:Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi
Title: When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging
Abstract:
Model merging combines multiple fine‑tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over‑counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training‑free and data‑free post‑processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state‑of‑the‑art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: https://github.com/lyymuwu/SVC.

Authors:Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing
Title: Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
Abstract:
Large vision‑language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emphmisbehaviors of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine‑grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out‑of‑distribution (OOD) failures, using state‑of‑the‑art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer‑wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.

Authors:Budhaditya Mukhopadhyay, Chirag Mandal, Pavan Tummala, Naghmeh Mahmoodian, Andreas Nürnberger, Soumick Chatterjee
Title: Towards Segmenting the Invisible: An End-to-End Registration and Segmentation Framework for Weakly Supervised Tumour Analysis
Abstract:
Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre‑operative MRI, they are often effectively invisible on intra‑operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross‑modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration‑segmentation framework that combines MSCGUNet for inter‑modal image registration with a UNet‑based segmentation module, enabling registration‑assisted pseudo‑label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the "domain gap" and "feature absence" problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration‑based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross‑modality medical image analysis. Code an weights are available at: https://github.com/BudhaTronix/Weakly‑Supervised‑Tumour‑Detection

Authors:Kritchanat Ponyuenyong, Pengyu Tu, Jia Wei Tan, Wei Soon Cheong, Jamie Ng Suat Ling, Lianlian Jiang
Title: Day-Ahead Electricity Price Forecasting for Volatile Markets Using Foundation Models with Regularization Strategy
Abstract:
Electricity price forecasting (EPF) is essential for energy markets stakeholders (e.g. grid operators, energy traders, policymakers) but remains challenging due to the inherent volatility and nonlinearity of price signals. Traditional statistical and deep learning (DL) models often struggle to capture complex temporal dependencies and integrate heterogeneous data effectively. While time series foundation models (TSFMs) have shown strong performance in general time series forecasting tasks, such as traffic forecasting and weather forecasting. However, their effectiveness in day‑ahead EPF, particularly in volatile markets, remains underexplored. This paper presents a spike regularization strategy and evaluates a wide range of TSFMs, including Tiny Time Mixers (TTMs), MOIRAI, MOMENT, and TimesFM, against traditional statistical and DL models such as Autoregressive Integrated Moving Average (ARIMA), Long‑short Term Memory (LSTM), and Convolutional Neural Network ‑ LSTM (CNN‑LSTM) using half‑hourly wholesale market data with volatile trends in Singapore. Exogenous factors (e.g. weather and calendar variables) are also incorporated into models where applicable. Results demonstrate that TSFMs consistently outperform traditional approaches, achieving up to 37.4% improvement in MAPE across various evaluation settings. The findings offer practical guidance for improving forecast accuracy and decision‑making in volatile electricity markets.

Authors:Wei Soon Cheong, Lian Lian Jiang, Jamie Ng Suat Ling
Title: Assessing Electricity Demand Forecasting with Exogenous Data in Time Series Foundation Models
Abstract:
Time‑series foundation models have emerged as a new paradigm for forecasting, yet their ability to effectively leverage exogenous features ‑‑ critical for electricity demand forecasting ‑‑ remains unclear. This paper empirically evaluates foundation models capable of modeling cross‑channel correlations against a baseline LSTM with reversible instance normalization across Singaporean and Australian electricity markets at hourly and daily granularities. We systematically assess MOIRAI, MOMENT, TinyTimeMixers, ChronosX, and Chronos‑2 under three feature configurations: all features, selected features, and target‑only. Our findings reveal highly variable effectiveness: while Chronos‑2 achieves the best performance among foundation models (in zero‑shot settings), the simple baseline frequently outperforms all foundation models in Singapore's stable climate, particularly for short‑term horizons. Model architecture proves critical, with synergistic architectural implementations (TTM's channel‑mixing, Chronos‑2's grouped attention) consistently leveraging exogenous features, while other approaches show inconsistent benefits. Geographic context emerges as equally important, with foundation models demonstrating advantages primarily in variable climates. These results challenge assumptions about universal foundation model superiority and highlight the need for domain‑specific models, specifically in the energy domain.

Authors:Zolnamar Dorjsembe, Hung-Yi Chen, Furen Xiao, Hsing-Kuo Pao
Title: Parallel Swin Transformer-Enhanced 3D MRI-to-CT Synthesis for MRI-Only Radiotherapy Planning
Abstract:
MRI provides superior soft tissue contrast without ionizing radiation; however, the absence of electron density information limits its direct use for dose calculation. As a result, current radiotherapy workflows rely on combined MRI and CT acquisitions, increasing registration uncertainty and procedural complexity. Synthetic CT generation enables MRI only planning but remains challenging due to nonlinear MRI‑CT relationships and anatomical variability. We propose Parallel Swin Transformer‑Enhanced Med2Transformer, a 3D architecture that integrates convolutional encoding with dual Swin Transformer branches to model both local anatomical detail and long‑range contextual dependencies. Multi‑scale shifted window attention with hierarchical feature aggregation improves anatomical fidelity. Experiments on public and clinical datasets demonstrate higher image similarity and improved geometric accuracy compared with baseline methods. Dosimetric evaluation shows clinically acceptable performance, with a mean target dose error of 1.69%. Code is available at: https://github.com/mobaidoctor/med2transformer.

Authors:Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang
Title: CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs
Abstract:
Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out‑of‑distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state‑of‑the‑art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.

Authors:Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, Jianwei Yin
Title: Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
Abstract:
Intervention‑based model steering offers a lightweight and interpretable alternative to prompting and fine‑tuning. However, by adapting strong optimization objectives from fine‑tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak‑supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi‑directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large‑scale model steering benchmark, we show that CDAS does not always outperform preference‑optimization methods but may benefit more from increased model scale. In two safety‑related case studies, overriding refusal behaviors of safety‑aligned models and neutralizing a chain‑of‑thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference‑optimization approaches and conditionally constitutes a robust approach to intervention‑based model steering. Our code is available at https://github.com/colored‑dye/concept_das.

Authors:Changhoon Song, Teng Yuan Chang, Youngjoon Hong
Title: Extreme Weather Nowcasting via Local Precipitation Pattern Prediction
Abstract:
Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high‑resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine‑scale rainfall structures, and variability in forecasting horizons. While recent diffusion‑based generative ensembles show promising results, they are computationally expensive and unsuitable for real‑time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed‑‑either dominated by ordinary rainfall events or restricted to extreme rainfall episodes‑‑thereby hindering general applicability in real‑world settings. In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture‑preserving cubic dual upsampling decoder, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state‑of‑the‑art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.

Authors:Magesh Rajasekaran, Md Saiful Sajol, Chris Alvin, Supratik Mukhopadhyay, Yanda Ou, Z. George Xue
Title: Benchmarking Artificial Intelligence Models for Daily Coastal Hypoxia Forecasting
Abstract:
Coastal hypoxia, especially in the northern part of Gulf of Mexico, presents a persistent ecological and economic concern. Seasonal models offer coarse forecasts that miss the fine‑scale variability needed for daily, responsive ecosystem management. We present study that compares four deep learning architectures for daily hypoxia classification: Bidirectional Long Short‑Term Memory (BiLSTM), Medformer (Medical Transformer), Spatio‑Temporal Transformer (ST‑Transformer), and Temporal Convolutional Network (TCN). We trained our models with twelve years of daily hindcast data from 2009‑2020 Our training data consists of 2009‑2020 hindcast data from a coupled hydrodynamic‑biogeochemical model. Similarly, we use hindcast data from 2020 through 2024 as a test data. We constructed classification models incorporating water column stratification, sediment oxygen consumption, and temperature‑dependent decomposition rates. We evaluated each architectures using the same data preprocessing, input/output formulation, and validation protocols. Each model achieved high classification accuracy and strong discriminative ability with ST‑Transformer achieving the highest performance across all metrics and tests periods (AUC‑ROC: 0.982‑0.992). We also employed McNemar's method to identify statistically significant differences in model predictions. Our contribution is a reproducible framework for operational real‑time hypoxia prediction that can support broader efforts in the environmental and ocean modeling systems community and in ecosystem resilience. The source code is available https://github.com/rmagesh148/hypoxia‑ai/

Authors:Olga Ovcharenko, Matthias Boehm, Sebastian Schelter
Title: SemPipes -- Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines
Abstract:
Real‑world machine learning on tabular data relies on complex data preparation pipelines for prediction, data integration, augmentation, and debugging. Designing these pipelines requires substantial domain expertise and engineering effort, motivating the question of how large language models (LLMs) can support tabular ML through code synthesis. We introduce SemPipes, a novel declarative programming model that integrates LLM‑powered semantic data operators into tabular ML pipelines. Semantic operators specify data transformations in natural language while delegating execution to a runtime system. During training, SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context. This design enables the automatic optimization of data operations in a pipeline via LLM‑based code synthesis guided by evolutionary search. We evaluate SemPipes across diverse tabular ML tasks and show that semantic operators substantially improve end‑to‑end predictive performance for both expert‑designed and agent‑generated pipelines, while reducing pipeline complexity. We implement SemPipes in Python and release it at https://github.com/deem‑data/sempipes/tree/v1.

Authors:Abdul Joseph Fofanah, Lian Wen, David Chen, Alpha Alimamy Kamara, Zhongyi Zhang
Title: CAST-CKT: Chaos-Aware Spatio-Temporal and Cross-City Knowledge Transfer for Traffic Flow Prediction
Abstract:
Traffic prediction in data‑scarce, cross‑city settings is challenging due to complex nonlinear dynamics and domain shifts. Existing methods often fail to capture traffic's inherent chaotic nature for effective few‑shot learning. We propose CAST‑CKT, a novel Chaos‑Aware Spatio‑Temporal and Cross‑City Knowledge Transfer framework. It employs an efficient chaotic analyser to quantify traffic predictability regimes, driving several key innovations: chaos‑aware attention for regime‑adaptive temporal modelling; adaptive topology learning for dynamic spatial dependencies; and chaotic consistency‑based cross‑city alignment for knowledge transfer. The framework also provides horizon‑specific predictions with uncertainty quantification. Theoretical analysis shows improved generalisation bounds. Extensive experiments on four benchmarks in cross‑city few‑shot settings show CAST‑CKT outperforms state‑of‑the‑art methods by significant margins in MAE and RMSE, while offering interpretable regime analysis. Code is available at https://github.com/afofanah/CAST‑CKT.

Authors:Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci
Title: Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization
Abstract:
Selecting the best data mixture is critical for successful Supervised Fine‑Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain‑specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so‑called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain‑specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain‑specific multimodal experts and evaluate their weighted parameter‑space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource‑intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.

Authors:Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
Title: TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation
Abstract:
The rapid growth of large language models (LLMs) has heightened the importance of post‑training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion‑scale LLMs to be quantized within a few GPU hours. However, GPTQ's assumption of layer‑wise independence leads to severe accuracy drops in low‑bit regimes. Recently, BoA improved upon GPTQ by incorporating inter‑layer dependencies within attention modules, but its reliance on sequential quantization across all out‑channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation‑free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out‑channels with a closed‑form error compensation rule, which reduces sequential bottlenecks and yields more than a three‑fold speedup; (ii) a correction mechanism for errors propagated from preceding quantized layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state‑of‑the‑art results in both weight‑only and weight‑activation quantization. The code will be available at https://github.com/SamsungLabs/TurboBoA.

Authors:Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao
Title: Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
Abstract:
The internalization of chain‑of‑thought processes into hidden states has emerged as a highly efficient paradigm for scaling test‑time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non‑stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self‑Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three‑stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value‑modulated trajectory intervention dynamically injects context‑specific impulses via anchor‑based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain‑of‑thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM‑Latent‑Action.

Authors:Dinh Phu Tran, Jihoon Jeong, Saad Wazir, Seongah Kim, Thao Do, Cem Subakan, Daeyoung Kim
Title: Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering
Abstract:
We present a formal problem formulation for Reliable Audio‑Visual Question Answering (\mathcalR‑AVQA), where we prefer abstention over answering incorrectly. While recent AVQA models have high accuracy, their ability to identify when they are likely wrong and their consequent abstention from answering remain underexplored areas of research. To fill this gap, we explore several approaches and then propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of \mathcalR‑AVQA. Our key insight is that the Maximum Softmax Probability (MSP) is Bayes‑optimal only under strong calibration, a condition usually not met in deep neural networks, particularly in multimodal models. Instead of replacing MSP, our ACR maintains it as a primary confidence signal and applies input‑adaptive residual corrections when MSP is deemed unreliable. ACR introduces two learned heads: i) a Residual Risk Head that predicts low‑magnitude correctness residuals that MSP does not capture, and ii) a Confidence Gating Head to determine MSP trustworthiness. Our experiments and theoretical analysis show that ACR consistently outperforms existing methods on in‑ and out‑of‑disrtibution, and data bias settings across three different AVQA architectures, establishing a solid foundation for \mathcalR‑AVQA task. The code and checkpoints will be available upon acceptance \hrefhttps://github.com/PhuTran1005/R‑AVQAat here

Authors:Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
Title: A Causal Perspective for Enhancing Jailbreak Attack and Defense
Abstract:
Uncovering the mechanisms behind "jailbreaks" in large language models (LLMs) is crucial for enhancing their safety and reliability, yet these mechanisms remain poorly understood. Existing studies predominantly analyze jailbreak prompts by probing latent representations, often overlooking the causal relationships between interpretable prompt features and jailbreak occurrences. In this work, we propose Causal Analyst, a framework that integrates LLMs into data‑driven causal discovery to identify the direct causes of jailbreaks and leverage them for both attack and defense. We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven LLMs, systematically constructed from 100 attack templates and 50 harmful queries, annotated with 37 meticulously designed human‑readable prompt features. By jointly training LLM‑based prompt encoding and GNN‑based causal graph learning, we reconstruct causal pathways linking prompt features to jailbreak responses. Our analysis reveals that specific features, such as "Positive Character" and "Number of Task Steps", act as direct causal drivers of jailbreaks. We demonstrate the practical utility of these insights through two applications: (1) a Jailbreaking Enhancer that targets identified causal features to significantly boost attack success rates on public benchmarks, and (2) a Guardrail Advisor that utilizes the learned causal graph to extract true malicious intent from obfuscated queries. Extensive experiments, including baseline comparisons and causal structure validation, confirm the robustness of our causal analysis and its superiority over non‑causal approaches. Our results suggest that analyzing jailbreak features from a causal perspective is an effective and interpretable approach for improving LLM reliability. Our code is available at https://github.com/Master‑PLC/Causal‑Analyst.

Authors:Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, Nika Haghtalab
Title: Subliminal Effects in Your Data: A General Mechanism via Log-Linearity
Abstract:
Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset‑centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit‑Linear‑Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real‑world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.

Authors:Philipp Nazari, T. Konstantin Rusch
Title: The Key to State Reduction in Linear Attention: A Rank-based Perspective
Abstract:
Linear attention offers a computationally efficient yet expressive alternative to softmax attention. However, recent empirical results indicate that the state of trained linear attention models often exhibits a low‑rank structure, suggesting that these models underexploit their capacity in practice. To illuminate this phenomenon, we provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise. In addition to these theoretical insights, we conjecture that the low‑rank states can be substantially reduced post‑training with only minimal performance degradation, yielding faster and more memory‑efficient models. To this end, we propose a novel hardware‑aware approach that structurally prunes key and query matrices, reducing the state size while retaining compatibility with existing CUDA kernels. We adapt several existing pruning strategies to fit our framework and, building on our theoretical analysis, propose a novel structured pruning method based on a rank‑revealing QR decomposition. Our empirical results, evaluated across models of varying sizes and on various downstream tasks, demonstrate the effectiveness of our state reduction framework. We highlight that our framework enables the removal of 50% of the query and key channels at only a marginal increase in perplexity. The code for this project can be found at https://github.com/camail‑official/LinearAttentionPruning.

Authors:Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu, Maosong Sun
Title: SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization
Abstract:
True self‑evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre‑training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE‑Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo‑novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open‑Book Paradox, where training with reference documentation inhibits retention, requiring "Closed‑Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self‑Play for internalization, proving models can learn from self‑generated, noisy tasks when coupled with SFT, but not RL. Overall, SE‑Bench establishes a rigorous diagnostic platform for self‑evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE‑Bench.

Authors:Kieran A. Murphy
Title: From independent patches to coordinated attention: Controlling information flow in vision transformers
Abstract:
We make the information transmitted by attention an explicit, measurable quantity in vision transformers. By inserting variational information bottlenecks on all attention‑mediated writes to the residual stream ‑‑ without other architectural changes ‑‑ we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet‑100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal communication, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.

Authors:Mingyang Deng, He Li, Tianhong Li, Yilun Du, Kaiming He
Title: Generative Modeling via Drifting
Abstract:
Generative modeling can be formulated as learning a mapping f such that its pushforward distribution matches the data distribution. The pushforward behavior can be carried out iteratively at inference time, for example in diffusion and flow‑based models. In this paper, we propose a new paradigm called Drifting Models, which evolve the pushforward distribution during training and naturally admit one‑step inference. We introduce a drifting field that governs the sample movement and achieves equilibrium when the distributions match. This leads to a training objective that allows the neural network optimizer to evolve the distribution. In experiments, our one‑step generator achieves state‑of‑the‑art results on ImageNet at 256 x 256 resolution, with an FID of 1.54 in latent space and 1.61 in pixel space. We hope that our work opens up new opportunities for high‑quality one‑step generation.

Authors:Yan Chen, Jie Peng, Moajjem Hossain Chowdhury, Tianlong Chen, Yunmei Liu
Title: NeuroCanvas: VLLM-Powered Robust Seizure Detection by Reformulating Multichannel EEG as Image
Abstract:
Accurate and timely seizure detection from Electroencephalography (EEG) is critical for clinical intervention, yet manual review of long‑term recordings is labor‑intensive. Recent efforts to encode EEG signals into large language models (LLMs) show promise in handling neural signals across diverse patients, but two significant challenges remain: (1) multi‑channel heterogeneity, as seizure‑relevant information varies substantially across EEG channels, and (2) computing inefficiency, as the EEG signals need to be encoded into a massive number of tokens for the prediction. To address these issues, we draw the EEG signal and propose the novel NeuroCanvas framework. Specifically, NeuroCanvas consists of two modules: (i) The Entropy‑guided Channel Selector (ECS) selects the seizure‑relevant channels input to LLM and (ii) the following Canvas of Neuron Signal (CNS) converts selected multi‑channel heterogeneous EEG signals into structured visual representations. The ECS module alleviates the multi‑channel heterogeneity issue, and the CNS uses compact visual tokens to represent the EEG signals that improve the computing efficiency. We evaluate NeuroCanvas across multiple seizure detection datasets, demonstrating a significant improvement of 20% in F1 score and reductions of 88% in inference latency. These results highlight NeuroCanvas as a scalable and effective solution for real‑time and resource‑efficient seizure detection in clinical practice.The code will be released at https://github.com/Yanchen30247/seizure_detect.

Authors:Kejiang Qian, Amos Storkey, Fengxiang He
Title: Rationality Measurement and Theory for Reinforcement Learning Agents
Abstract:
This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the 1‑Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, \ell_2 regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at https://github.com/EVIEHub/Rationality.

Authors:Thatchanon Anancharoenkij, Donlapark Ponnoprat
Title: Conditional Counterfactual Mean Embeddings: Doubly Robust Estimation and Learning Rates
Abstract:
A complete understanding of heterogeneous treatment effects involves characterizing the full conditional distribution of potential outcomes. To this end, we propose the Conditional Counterfactual Mean Embeddings (CCME), a framework that embeds conditional distributions of counterfactual outcomes into a reproducing kernel Hilbert space (RKHS). Under this framework, we develop a two‑stage meta‑estimator for CCME that accommodates any RKHS‑valued regression in each stage. Based on this meta‑estimator, we develop three practical CCME estimators: (1) Ridge Regression estimator, (2) Deep Feature estimator that parameterizes the feature map by a neural network, and (3) Neural‑Kernel estimator that performs RKHS‑valued regression, with the coefficients parameterized by a neural network. We provide finite‑sample convergence rates for all estimators, establishing that they possess the double robustness property. Our experiments demonstrate that our estimators accurately recover distributional features including multimodal structure of conditional counterfactual distributions.

Authors:Moritz Miller, Florent Draye, Bernhard Schölkopf
Title: Identifying Intervenable and Interpretable Features via Orthogonality Regularization
Abstract:
With recent progress on fine‑tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the Independent Causal Mechanisms principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under \texttthttps://github.com/mrtzmllr/sae‑icm.

Authors:Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, Rong Xiao
Title: Delving into Muon and Beyond: Deep Analysis and Extensions
Abstract:
The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix‑shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbolΣ^p V' , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first‑moment updates, as in momentum SGD, and to root‑mean‑square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS‑normalized updates yield more stable optimization than first‑moment updates. Moreover, while spectral compression provides strong stabilization benefits under first‑moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at https://github.com/Ocram7/BeyondMuon.

Authors:Dipan Maity
Title: SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF
Abstract:
Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL‑divergence constraint used in LM‑RLHF in an ad‑hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor‑critic RL method for the LM‑RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy‑aware control),a novel RLHF algorithm that combines a Double Soft‑Min Critic for pessimistic value estimation with a new multi‑layer stabilization framework combining entropy‑gated KL regulation, and PID‑controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high‑entropy exploration from low‑entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15% training‑average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash‑resistant RLHF framework that maintains aggressive learning speed while ensuring stable long‑horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE

Authors:Sandra Fortini, Kenyon Ng, Sonia Petrone, Judith Rousseau, Susan Wei
Title: A principled framework for uncertainty decomposition in TabPFN
Abstract:
TabPFN is a transformer that achieves state‑of‑the‑art performance on supervised tabular tasks by amortizing Bayesian prediction into a single forward pass. However, there is currently no method for uncertainty decomposition in TabPFN. Because it behaves, in an idealised limit, as a Bayesian in‑context learner, we cast the decomposition challenge as a Bayesian predictive inference (BPI) problem. The main computational tool in BPI, predictive Monte Carlo, is challenging to apply here as it requires simulating unmodeled covariates. We therefore pursue the asymptotic alternative, filling a gap in the theory for supervised settings by proving a predictive CLT under quasi‑martingale conditions. We derive variance estimators determined by the volatility of predictive updates along the context. The resulting credible bands are fast to compute, target epistemic uncertainty, and achieve near‑nominal frequentist coverage. For classification, we further obtain an entropy‑based uncertainty decomposition.

Authors:Lunjun Zhang, Jimmy Ba
Title: EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL
Abstract:
Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q‑learning. Second, we introduce Top‑k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top‑k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA‑PG) lead to a significant performance boost. On math reasoning, it allows R1‑distilled Qwen‑1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen‑3B base, EMA‑PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% \rightarrow 44.1% on HotpotQA, 27.4% \rightarrow 40.1% on 2WikiMultiHopQA. Overall, we show that EMA‑PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema‑pg

Authors:Puyue Wang, Jiawei Hu, Yan Gao, Junyan Wang, Yu Zhang, Gillian Dobbie, Tao Gu, Wafa Johal, Ting Dang, Hong Jia
Title: HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation
Abstract:
Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two‑stage learning framework for robust humanoid control under domain shift. First, we train a high‑performance teacher policy via history‑conditioned reinforcement learning, where the policy infers latent dynamics context from recent state‑‑action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher's robust control capabilities into a transformer‑based student policy that operates on sparse root‑relative 3D joint keypoint trajectories. By combining history‑conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero‑shot to unseen domains without per‑domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at https://tonywang‑0517.github.io/hord/.

Authors:Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng
Title: SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
Abstract:
Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next‑scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high‑resolution scales, which speeds up inference but discards high‑frequency details and harms image quality. To address these problems, we present SparVAR, a training‑free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross‑scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high‑resolution scales from a sparse decision scale, and construct scale self‑similar sparse attention via an efficient index‑mapping mechanism, enabling high‑efficiency sparse attention computation at large scales. Furthermore, we propose cross‑scale local sparse attention and implement an efficient block‑wise sparse kernel, which achieves \mathbf> 5× faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing 1024×1024 high‑resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a \mathbf1.57× speed‑up while preserving almost all high‑frequency details. When combined with existing scale‑skipping strategies, SparseVAR attains up to a \mathbf2.28× acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS‑CLab/SparVAR.

Authors:Yansong Ning, Jun Fang, Naiqiang Tan, Hao Liu
Title: Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning
Abstract:
Managing agent thought and observation during multi‑turn agent‑environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent‑Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold‑start data, including both single‑turn and multi‑turn omission scenarios, to fine‑tune the agent for omission behaviors. Furthermore, we introduce an omit‑aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper‑bounded by KL‑divergence. Experimental results on five agent benchmarks show that our constructed Agent‑Omit‑8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness‑efficiency trade‑off than seven efficient LLM agents methods. Our code and data are available at https://github.com/usail‑hkust/Agent‑Omit.

Authors:Suzeyu Chen, Leheng Li, Ying-Cong Chen
Title: SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction
Abstract:
Achieving highly accurate and real‑time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non‑uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype‑based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two‑stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype‑guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground‑truth masks to provide explicit guidance, guaranteeing a consistent query‑prototype association across decoder layers. Our model, dubbed SPOT‑Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.

Authors:Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu
Title: RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
Abstract:
Large Reasoning Models (LRMs) have achieved tremendous success with their chain‑of‑thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk‑Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs' safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at https://github.com/weizeming/RAPO.

Authors:Angel Martinez-Sanchez, Parthib Roy, Ross Greer
Title: Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models
Abstract:
Instruction‑grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction‑following planners rely on simulation or fixed command vocabularies, limiting real‑world generalization. doScenes, the first real‑world dataset linking free‑form instructions (with referentiality) to nuScenes ground‑truth motion, enables instruction‑conditioned planning. In this work, we adapt OpenEMMA, an open‑source MLLM‑based end‑to‑end driving framework that ingests front‑camera views and ego‑state and outputs 10‑step speed‑curvature trajectories, to this setting, presenting a reproducible instruction‑conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger‑style prompts within OpenEMMA's vision‑language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well‑phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a "good" instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction‑aware planning. GitHub: https://github.com/Mi3‑Lab/doScenes‑VLM‑Planning

Authors:Dhruv S. Kushwaha, Zoleikha A. Biron
Title: Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking
Abstract:
Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision‑making problems. However, its application to safety‑critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has significant work in incorporating Lyapunov‑based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov‑constrained Soft Actor‑Critic (LC‑SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe‑control‑gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: https://github.com/DhruvKushwaha/LC‑SAC‑Quadrotor‑Trajectory‑Tracking

Authors:Xiaofeng Lin, Sirou Zhu, Yilei Chen, Mingyu Chen, Hejian Sang, Ioannis Paschalidis, Zhipeng Wang, Aldo Pacchiano, Xuezhou Zhang
Title: Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL
Abstract:
Large language models (LLMs) achieve strong performance when all task‑relevant information is available upfront, as in static prediction and instruction‑following problems. However, many real‑world decision‑making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in‑context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in‑context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi‑task, multi‑episode meta‑reinforcement learning framework that trains LLMs to learn from interaction in context. After meta‑training, a relatively small open‑source model (Qwen3‑14B) demonstrates substantially improved in‑context online learning on entirely unseen environments, matching the performance of GPT‑5.2 and outperforming standard RL fine‑tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn‑at‑inference‑time decision‑making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.

Authors:Felipe Angelim, Alessandro Leite
Title: Partition Trees: Conditional Density Estimation over General Outcome Spaces
Abstract:
We propose Partition Trees, a tree‑based framework for conditional density estimation over general outcome spaces, supporting both continuous and categorical variables within a unified formulation. Our approach models conditional distributions as piecewise‑constant densities on data adaptive partitions and learns trees by directly minimizing conditional negative log‑likelihood. This yields a scalable, nonparametric alternative to existing probabilistic trees that does not make parametric assumptions about the target distribution. We further introduce Partition Forests, an ensemble extension obtained by averaging conditional densities. Empirically, we demonstrate improved probabilistic prediction over CART‑style trees and competitive or superior performance compared to state‑of‑the‑art probabilistic tree methods and Random Forests, along with robustness to redundant features and heteroscedastic noise.

Authors:Pengcheng Wang, Qinghang Liu, Haotian Lin, Yiheng Li, Guojian Zhan, Masayoshi Tomizuka, Yixiao Wang
Title: DADP: Domain Adaptive Diffusion Policy
Abstract:
Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning‑based control. Substantial progress has been made through domain representation learning to capture domain‑specific information, thus enabling domain‑aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero‑shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain‑aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the https://outsider86.github.io/DomainAdaptiveDiffusionPolicy/.

Authors:Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi, Johannes Hoffart, Carlos Guestrin, Jure Leskovec
Title: PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models
Abstract:
Relational Foundation Models (RFMs) facilitate data‑driven decision‑making by learning from complex multi‑table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary‑‑foreign key connectivity for multi‑table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi‑tabular relational databases from scratch. In a step‑by‑step fashion, PluRel models (1) schemas with directed graphs, (2) inter‑table primary‑foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power‑law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.

Authors:Jusheng Zhang, Ningyuan Liu, Qinhan Lyu, Jing Yang, Keze Wang
Title: Rational ANOVA Networks
Abstract:
Deep neural networks typically treat nonlinearities as fixed primitives (e.g., ReLU), limiting both interpretability and the granularity of control over the induced function class. While recent additive models (like KANs) attempt to address this using splines, they often suffer from computational inefficiency and boundary instability. We propose the Rational‑ANOVA Network (RAN), a foundational architecture grounded in functional ANOVA decomposition and Padé‑style rational approximation. RAN models f(x) as a composition of main effects and sparse pairwise interactions, where each component is parameterized by a stable, learnable rational unit. Crucially, we enforce a strictly positive denominator, which avoids poles and numerical instability while capturing sharp transitions and near‑singular behaviors more efficiently than polynomial bases. This ANOVA structure provides an explicit low‑order interaction bias for data efficiency and interpretability, while the rational parameterization significantly improves extrapolation. Across controlled function benchmarks and vision classification tasks (e.g., CIFAR‑10) under matched parameter and compute budgets, RAN matches or surpasses parameter‑matched MLPs and learnable‑activation baselines, with better stability and throughput. Code is available at https://github.com/jushengzhang/Rational‑ANOVA‑Networks.git.

Authors:Aijie Shu, Wenbin Wu, Gbenga Ibikunle, Fengxiang He
Title: DeXposure-FM: A Time-series, Graph Foundation Model for Credit Exposures and Stability on Decentralized Financial Networks
Abstract:
Credit exposure in Decentralized Finance (DeFi) is often implicit and token‑mediated, creating a dense web of inter‑protocol dependencies. Thus, a shock to one token may result in significant and uncontrolled contagion effects. As the DeFi ecosystem becomes increasingly linked with traditional financial infrastructure through instruments, such as stablecoins, the risk posed by this dynamic demands more powerful quantification tools. We introduce DeXposure‑FM, the first time‑series, graph foundation model for measuring and forecasting inter‑protocol credit exposure on DeFi networks, to the best of our knowledge. Employing a graph‑tabular encoder, with pre‑trained weight initialization, and multiple task‑specific heads, DeXposure‑FM is trained on the DeXposure dataset that has 43.7 million data entries, across 4,300+ protocols on 602 blockchains, covering 24,300+ unique tokens. The training is operationalized for credit‑exposure forecasting, predicting the joint dynamics of (1) protocol‑level flows, and (2) the topology and weights of credit‑exposure links. The DeXposure‑FM is empirically validated on two machine learning benchmarks; it consistently outperforms the state‑of‑the‑art approaches, including a graph foundation model and temporal graph neural networks. DeXposure‑FM further produces financial economics tools that support macroprudential monitoring and scenario‑based DeFi stress testing, by enabling protocol‑level systemic‑importance scores, sector‑level spillover and concentration measures via a forecast‑then‑measure pipeline. Empirical verification fully supports our financial economics tools. The model and code have been publicly available. Model: https://huggingface.co/EVIEHub/DeXposure‑FM. Code: https://github.com/EVIEHub/DeXposure‑FM.

Authors:Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez
Title: SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Abstract:
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision‑language models (VLMs). Prior work largely relied on synthetic or LLM‑generated environments with limited task designs and puzzle‑like setups, failing to capture the real‑world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question‑answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple‑choice and open‑ended evaluation. Experiments across diverse state‑of‑the‑art VLMs, including open‑ and closed‑source models, reasoning‑focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple‑choice setup, InternVL3.5‑72B achieves 54.93% accuracy versus 87.57% for humans. In the open‑ended setting, all models show a performance drop of around 10‑25%, with GPT‑5‑mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real‑world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human‑aligned spatial understanding. SpatiaLab is available at: https://spatialab‑reasoning.github.io/.

Authors:Michael Ibrahim, Hanqi Zhao, Eli Sennesh, Zhi Li, Anqi Wu, Jacob L. Yates, Chengrui Li, Hadi Vafaii
Title: A Hitchhiker's Guide to Poisson Gradient Estimation
Abstract:
Poisson‑distributed latent variable models are widely used in computational neuroscience, but differentiating through discrete stochastic samples remains challenging. Two approaches address this: Exponential Arrival Time (EAT) simulation and Gumbel‑SoftMax (GSM) relaxation. We provide the first systematic comparison of these methods, along with practical guidance for practitioners. Our main technical contribution is a modification to the EAT method that theoretically guarantees an unbiased first moment (exactly matching the firing rate), and reduces second‑moment bias. We evaluate these methods on their distributional fidelity, gradient quality, and performance on two tasks: (1) variational autoencoders with Poisson latents, and (2) partially observable generalized linear models, where latent neural connectivity must be inferred from observed spike trains. Across all metrics, our modified EAT method exhibits better overall performance (often comparable to exact gradients), and substantially higher robustness to hyperparameter choices. Together, our results clarify the trade‑offs between these methods and offer concrete recommendations for practitioners working with Poisson latent variable models.

Authors:Jinxing Zhou, Yanghao Zhou, Yaoting Wang, Zongyan Han, Jiaqi Ma, Henghui Ding, Rao Muhammad Anwer, Hisham Cholakkal
Title: Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
Abstract:
Language‑referred audio‑visual segmentation (Ref‑AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref‑AVS context (MQA‑RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground‑truth annotations as references at inference time. Given audio‑visual‑language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality‑control decision. To support this task, we construct MQ‑RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ‑Auditor, a multimodal large language model (MLLM)‑based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ‑Auditor outperforms strong open‑source and commercial MLLMs and can be integrated with existing Ref‑AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA‑RefAVS.

Authors:Romain Cosentino
Title: PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning
Abstract:
We develop a continual learning method for pretrained models that \emphrequires no access to old‑task data, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emphgeometric redundancy, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining‑era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emphwhere to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old‑data distribution and improved worst‑case retention guarantees. These insights lead to \textscPLATE (Plasticity‑Tunable Efficient Adapters), a continual learning method requiring no past‑task data that provides explicit control over the plasticity‑retention trade‑off. PLATE parameterizes each layer with a structured low‑rank update ΔW = B A Q^\top, where B and Q are computed once from pretrained weights and kept frozen, and only A is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

Authors:Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen
Title: Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning
Abstract:
Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training‑inference mismatch: pruning‑trained models perform poorly when inferring on non‑pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast‑slow framework for efficient training of MLLMs. The fast‑mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's behaviors. The slow‑mode is the auxiliary mode, where the model is trained on full visual sequences to retain training‑inference consistency. To boost its training, it further leverages self‑distillation to learn from the sufficiently trained fast‑mode. Together, DualSpeed can achieve both training efficiency and non‑degraded performance. Experiments show DualSpeed accelerates the training of LLaVA‑1.5 by 2.1× and LLaVA‑NeXT by 4.0×, retaining over 99% performance. Code: https://github.com/dingkun‑zhang/DualSpeed

Authors:Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun
Title: Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation
Abstract:
Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real‑world tasks, such as multi‑turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi‑turn code generation can be formulated as a one‑step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single‑step code generation. Cobalt outperforms two multi‑turn online RL baselines based on GRPO and VeRPO, and substantially improves R1‑Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in‑context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision‑making tasks like multi‑turn code generation. Our code and data are available at https://github.com/OSU‑NLP‑Group/cobalt.

Authors:Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, Shangding Gu
Title: Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity
Abstract:
LLM‑based multi‑agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information‑theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture‑agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce K^, an effective channel count that quantifies the number of effective channels without ground‑truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity‑aware design. Code and Dataset are available at the link: https://github.com/SafeRL‑Lab/Agent‑Scaling.

Authors:Duy Nguyen, Hanqi Xiao, Archiki Prasad, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
Title: Conflict-Resolving and Sharpness-Aware Minimization for Generalized Knowledge Editing with Multiple Updates
Abstract:
Large language models (LLMs) rely on internal knowledge to solve many downstream tasks, making it crucial to keep them up to date. Since full retraining is expensive, prior work has explored efficient alternatives such as model editing and parameter‑efficient fine‑tuning. However, these approaches often break down in practice due to poor generalization across inputs, limited stability, and knowledge conflict. To address these limitations, we propose the CoRSA (Conflict‑Resolving and Sharpness‑Aware Minimization) training framework, a parameter‑efficient, holistic approach for knowledge editing with multiple updates. CoRSA tackles multiple challenges simultaneously: it improves generalization to different input forms and enhances stability across multiple updates by minimizing loss curvature, and resolves conflicts by maximizing the margin between new and prior knowledge. Across three widely used fact editing benchmarks, CoRSA achieves significant gains in generalization, outperforming baselines with average absolute improvements of 12.42% over LoRA and 10% over model editing methods. With multiple updates, it maintains high update efficacy while reducing catastrophic forgetting by 27.82% compared to LoRA. CoRSA also generalizes to the code domain, outperforming the strongest baseline by 5.48% Pass@5 in update efficacy.

Authors:Yicheng Zhang, Zhen Qin, Zhaomin Wu, Wenqi Zhang, Shuiguang Deng
Title: Reinforcement Fine-Tuning for History-Aware Dense Retriever in RAG
Abstract:
Retrieval‑augmented generation (RAG) enables large language models (LLMs) to produce evidence‑based responses, and its performance hinges on the matching between the retriever and LLMs. Retriever optimization has emerged as an efficient alternative to fine‑tuning LLMs. However, existing solutions suffer from objective mismatch between retriever optimization and the goal of RAG pipeline. Reinforcement learning (RL) provides a promising solution to address this limitation, yet applying RL to retriever optimization introduces two fundamental challenges: 1) the deterministic retrieval is incompatible with RL formulations, and 2) state aliasing arises from query‑only retrieval in multi‑hop reasoning. To address these challenges, we replace deterministic retrieval with stochastic sampling and formulate RAG as a Markov decision process, making retriever optimizable by RL. Further, we incorporate retrieval history into the state at each retrieval step to mitigate state aliasing. Extensive experiments across diverse RAG pipelines, datasets, and retriever scales demonstrate consistent improvements of our approach in RAG performance.

Authors:Chao Huang, Yujing Lu, Quangang Li, Shenghe Wang, Yan Wang, Yueyang Zhang, Long Xia, Jiashu Zhao, Zhiyuan Sun, Daiting Shi, Tingwen Liu
Title: TRE: Encouraging Exploration in the Trust Region
Abstract:
Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE‑Encouraging‑Exploration‑in‑the‑Trust‑Region.

Authors:Yaguo Liu, Mingyue Cheng, Daoyu Wang, Xiaoyu Tao, Qi Liu
Title: CoGenCast: A Coupled Autoregressive-Flow Generative Framework for Time Series Forecasting
Abstract:
Time series forecasting can be viewed as a generative problem that requires both semantic understanding over contextual conditions and stochastic modeling of continuous temporal dynamics. Existing approaches typically rely on either autoregressive large language models (LLMs) for semantic context modeling or diffusion‑like models for continuous probabilistic generation. However, neither method alone can adequately model both aspects simultaneously. In this work, we propose CoGenCast, a hybrid generative framework that couples pre‑trained LLMs with flow‑matching mechanism for effective time series forecasting. Specifically, we reconfigure pre‑trained decoder‑only LLMs into a native forecasting encoder‑decoder backbone by modifying only the attention topology, enabling bidirectional context encoding and causal representation generation. Building on this, a flow‑matching mechanism is further integrated to model temporal evolution, capturing continuous stochastic dynamics conditioned on the autoregressively generated representation. Notably, CoGenCast naturally supports multimodal forecasting and cross‑domain unified training. Extensive experiments on multiple benchmarks show that CoGenCast consistently outperforms previous compared baselines. Code is available at https://github.com/liuyaguo/_CoGenCast.

Authors:Maximilian Kleinegger, Elvir Crnčević, Dan Alistarh
Title: MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization
Abstract:
Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer‑quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization‑aware training (QAT) variants, rather than fast one‑shot post training quantization (PTQ), and lacks open‑source and kernel support. We address all of these limitations by introducing Post‑Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one‑shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi‑precision objective with bit‑slicing and cross‑bit error compensation, resulting in an algorithm that produces a multi‑bit‑width, "sliceable" model in a single pass. We also incorporate a new budget‑aware search for heterogeneous per‑layer bit‑witdhs and provide efficient kernels that implement slicing and mixed‑precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high‑bit accuracy while substantially improving performance at low‑bit‑witdh settings. Overall, we establish a new state of the art for Matryoshka‑style post‑training quantization and make single‑checkpoint, multi‑precision deployment open and practical. Code is available at https://github.com/IST‑DASLab/MatGPTQ.

Authors:Yiran Qiao, Jing Chen, Xiang Ao, Qiwei Zhong, Yang Liu, Qing He
Title: Live or Lie: Action-Aware Capsule Multiple Instance Learning for Risk Assessment in Live Streaming Platforms
Abstract:
Live streaming has become a cornerstone of today's internet, enabling massive real‑time social interactions. However, it faces severe risks arising from sparse, coordinated malicious behaviors among multiple participants, which are often concealed within normal activities and challenging to detect timely and accurately. In this work, we provide a pioneering study on risk assessment in live streaming rooms, characterized by weak supervision where only room‑level labels are available. We formulate the task as a Multiple Instance Learning (MIL) problem, treating each room as a bag and defining structured user‑timeslot capsules as instances. These capsules represent subsequences of user actions within specific time windows, encapsulating localized behavioral patterns. Based on this formulation, we propose AC‑MIL, an Action‑aware Capsule MIL framework that models both individual behaviors and group‑level coordination patterns. AC‑MIL captures multi‑granular semantics and behavioral cues through a serial and parallel architecture that jointly encodes temporal dynamics and cross‑user dependencies. These signals are integrated for robust room‑level risk prediction, while also offering interpretable evidence at the behavior segment level. Extensive experiments on large‑scale industrial datasets from Douyin demonstrate that AC‑MIL significantly outperforms MIL and sequential baselines, establishing new state‑of‑the‑art performance in room‑level risk assessment for live streaming. Moreover, AC‑MIL provides capsule‑level interpretability, enabling identification of risky behavior segments as actionable evidence for intervention. The project page is available at: https://qiaoyran.github.io/AC‑MIL/.

Authors:Meng Lou, Yunxiang Fu, Yizhou Yu
Title: Scaling Continual Learning with Bi-Level Routing Mixture-of-Experts
Abstract:
Continual learning, especially class‑incremental learning (CIL), on the basis of a pre‑trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable Continual Learner with efficient Bi‑Level Routing Mixture‑of‑Experts (BR‑MoE). The core idea of BR‑MoE is a bi‑level routing mechanism: a router selection stage that dynamically activates relevant task‑specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging evaluation protocol for comprehensively assessing CIL methods across very long task sequences spanning hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5‑20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non‑overlapping tasks), while outperforming all baselines by a large margin on such task sequences. Code will be publicly released at https://github.com/LMMMEng/CaRE.git.

Authors:Hyun Seok Seong, WonJun Moon, Jae-Pil Heo
Title: From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning
Abstract:
Unsupervised object‑centric learning models, particularly slot‑based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction‑based training creates a fundamental conflict between the sharp, high‑frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high‑frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder's sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder's spatial consistency to denoise the encoder's features. This mutual refinement process is stabilized by a warm‑up phase with a slot regularization objective that initially allocates distinct entities per slot. By bridging the representational gap between the encoder and decoder, SRL achieves state‑of‑the‑art results on video object‑centric learning benchmarks. Codes are available at https://github.com/hynnsk/SRL.

Authors:Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, Yehui Tang
Title: MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling
Abstract:
Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test‑time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory‑based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token‑level memory experts that injects pre‑stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re‑parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory‑based scaling paradigm for on‑device LLMs. Project homepage is at https://github.com/ningding‑o/MeKi.

Authors:Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, Jun Zhu
Title: RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization
Abstract:
Vision‑Language‑Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero‑shot deployment on novel embodiments for open‑vocabulary tasks. To achieve this, we collected one of the largest open‑source robotic datasets‑‑over 10,000 hours of demonstrations in diverse families‑‑using an enhanced, embodiment‑agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three‑stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow‑matching, and distillation for real‑time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero‑shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state‑of‑the‑art baselines in dexterous, long‑horizon, and dynamic downstream tasks like playing table tennis. See https://rdt‑robotics.github.io/rdt2/ for more information.

Authors:Francesco Di Salvo, Sebastian Doerrich, Jonas Alle, Christian Ledig
Title: HypCBC: Domain-Invariant Hyperbolic Cross-Branch Consistency for Generalizable Medical Image Analysis
Abstract:
Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the advantages of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in‑distribution datasets and three ViT models. We further propose an unsupervised, domain‑invariant hyperbolic cross‑branch consistency constraint. Extensive experiments confirm that our proposed method promotes domain‑invariant features and outperforms state‑of‑the‑art Euclidean methods by an average of +2.1% AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17‑WILDS, and a cross‑dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across substantially different conditions. The code is available at https://github.com/francescodisalvo05/hyperbolic‑cross‑branch‑consistency .

Authors:Wenquan Lu, Hai Huang, Randall Balestriero
Title: Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning
Abstract:
Reinforcement learning algorithms such as group‑relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post‑training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5‑20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low‑entropy regimes without premature collapse. Empirically, a Qwen2.5‑Math‑1.5B model trained with prompt augmentation on the MATH Level 3‑5 dataset achieves state‑of‑the‑art performance, reaching 45.2 per‑benchmark accuracy and 51.8 per‑question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt‑augmentation‑GRPO.

Authors:Xiaoyu Tao, Mingyue Cheng, Ze Guo, Shuo Yu, Yaguo Liu, Qi Liu, Shijin Wang
Title: MemCast: Memory-Driven Time Series Forecasting with Experience-Conditioned Reasoning
Abstract:
Time series forecasting (TSF) plays a critical role in decision‑making for many real‑world applications. Recently, LLM‑based forecasters have made promising advancements. Despite their effectiveness, existing methods often lack explicit experience accumulation and continual evolution. In this work, we propose MemCast, a learning‑to‑memory framework that reformulates TSF as an experience‑conditioned reasoning task. Specifically, we learn experience from the training set and organize it into a hierarchical memory. This is achieved by summarizing prediction results into historical patterns, distilling inference trajectories into reasoning wisdom, and inducing extracted temporal features into general laws. Furthermore, during inference, we leverage historical patterns to guide the reasoning process and utilize reasoning wisdom to select better trajectories, while general laws serve as criteria for reflective iteration. Additionally, to enable continual evolution, we design a dynamic confidence adaptation strategy that updates the confidence of individual entries without leaking the test set distribution. Extensive experiments on multiple datasets demonstrate that MemCast consistently outperforms previous methods, validating the effectiveness of our approach. Our code is available at https://github.com/Xiaoyu‑Tao/MemCast‑TS.

Authors:Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian
Title: Self-Hinting Language Models Enhance Reinforcement Learning
Abstract:
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self‑hint aligned GRPO with privileged supervision (SAGE), an on‑policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within‑group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=\varnothing and deploy the no‑hint policy without any privileged information. Moreover, sampling diverse self‑hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama‑3.2‑3B‑Instruct, +1.2 on Qwen2.5‑7B‑Instruct and +1.3 on Qwen3‑4B‑Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Authors:Yinggan Xu, Risto Miikkulainen, Xin Qiu
Title: Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost
Abstract:
Post‑Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory‑constrained devices, yet it renders models static and difficult to fine‑tune. Standard fine‑tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and high‑precision weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non‑differentiable. While Evolution Strategies (ES) offer a backpropagation‑free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full‑parameter fine‑tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high‑precision gradient signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low‑precision inference levels. QES significantly outperforms the state‑of‑the‑art zeroth‑order fine‑tuning method on arithmetic reasoning tasks, making direct fine‑tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at https://github.com/dibbla/Quantized‑Evolution‑Strategies .

Authors:Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer
Title: FlashSinkhorn: IO-Aware Entropic Optimal Transport
Abstract:
Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense n× m interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map‑reduce reduction kernels with limited fusion. We present FlashSinkhorn, an IO‑aware EOT solver for squared Euclidean cost that rewrites stabilized log‑domain Sinkhorn updates as row‑wise LogSumExp reductions of biased dot‑product scores, the same normalization as transformer attention. This enables FlashAttention‑style fusion and tiling: fused Triton kernels stream tiles through on‑chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear‑memory operations. We further provide streaming kernels for transport application, enabling scalable first‑ and second‑order optimization. On A100 GPUs, FlashSinkhorn achieves up to 32× forward‑pass and 161× end‑to‑end speedups over state‑of‑the‑art online baselines on point‑cloud OT, improves scalability on OT‑based downstream tasks. For reproducibility, we release an open‑source implementation at https://github.com/ot‑triton‑lab/ot_triton.

Authors:Evan Wang, Simon Chess, Daniel Lee, Siyuan Ge, Ajit Mallavarapu, Vasily Ilin
Title: Learning to Repair Lean Proofs from Compiler Feedback
Abstract:
As neural theorem provers become increasingly agentic, the ability to interpret and act on compiler feedback is critical. However, existing Lean datasets consist almost exclusively of correct proofs, offering little supervision for understanding and repairing failures. We study Lean proof repair as a supervised learning problem: given an erroneous proof and compiler feedback, predict both a corrected proof and a natural‑language diagnosis grounded in the same feedback. We introduce APRIL (Automated Proof Repair in Lean), a dataset of 260,000 supervised tuples pairing systematically generated proof failures with compiler diagnostics and aligned repair and explanation targets. Training language models on APRIL substantially improves repair accuracy and feedback‑conditioned reasoning; in our single‑shot repair evaluation setting, a finetuned 4B‑parameter model outperforms the strongest open‑source baseline. We view diagnostic‑conditioned supervision as a complementary training signal for feedback‑using provers. Our dataset is available at \hrefhttps://huggingface.co/datasets/uw‑math‑ai/APRILthis link.

Authors:Ran Li, Zeyuan Liu, Yinghao Chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Chen Qian, Zhiyuan Liu, Maosong Sun
Title: CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning
Abstract:
Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high‑quality human‑curated tasks and labels, either through supervised fine‑tuning (SFT) or reinforcement learning (RL) on reasoning‑specific data. This dependence renders supervision‑heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach‑Player paradigm for data‑free reinforcement learning of reasoning models. Unlike traditional adversarial self‑play, CPMöbius, inspired by real world human sports collaboration and multi‑agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5‑Math‑7B‑Instruct, our method improves accuracy by an overall average of +4.9 and an out‑of‑distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R‑zero by +4.2 on OOD accuracy. Our codebase has been released at https://github.com/thunlp/CPMobius.

Authors:Yin Jin, Tucker R. Stewart, Deyi Zhou, Chhavi Gupta, Arjita Nema, Scott C. Brakenridge, Grant E. O'Keefe, Juhua Hu
Title: Rare Event Early Detection: A Dataset of Sepsis Onset for Critically Ill Trauma Patients
Abstract:
Sepsis is a major public health concern due to its high morbidity, mortality, and cost. Its clinical outcome can be substantially improved through early detection and timely intervention. By leveraging publicly available datasets, machine learning (ML) has driven advances in both research and clinical practice. However, existing public datasets consider ICU patients (Intensive Care Unit) as a uniform group and neglect the potential challenges presented by critically ill trauma patients in whom injury‑related inflammation and organ dysfunction can overlap with the clinical features of sepsis. We propose that a targeted identification of post‑traumatic sepsis is necessary in order to develop methods for early detection. Therefore, we introduce a publicly available standardized post‑trauma sepsis onset dataset extracted, relabeled using standardized post‑trauma clinical facts, and validated from MIMIC‑III. Furthermore, we frame early detection of post‑trauma sepsis onset according to clinical workflow in ICUs in a daily basis resulting in a new rare event detection problem. We then establish a general benchmark through comprehensive experiments, which shows the necessity of further advancements using this new dataset. The data code is available at https://github.com/ML4UWHealth/SepsisOnset_TraumaCohort.git.

Authors:Yidong Ouyang, Panwen Hu, Zhengyan Wan, Zhe Wang, Liyan Xie, Dmitriy Bespalov, Ying Nian Wu, Guang Cheng, Hongyuan Zha, Qiang Sun
Title: Training-Free Self-Correction for Multimodal Masked Diffusion Models
Abstract:
Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self‑correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training‑free self‑correction framework that exploits the inductive biases of pre‑trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text‑to‑image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion architectures, highlighting its robustness and practical applicability. Code can be found in https://github.com/huge123/FreeCorrection.

Authors:Anthony Fuller, James R. Green, Evan Shelhamer
Title: Self-Soupervision: Cooking Model Soups without Labels
Abstract:
Model soups are strange and strangely effective combinations of parameters. They take a model (the stock), fine‑tune it into multiple models (the ingredients), and then mix their parameters back into one model (the soup) to improve predictions. While all known soups require supervised learning, and optimize the same loss on labeled data, our recipes for Self‑\emphSoupervision generalize soups to self‑supervised learning (SSL). Our Self‑Souping lets us flavor ingredients on new data sources, e.g. from unlabeled data from a task for transfer or from a shift for robustness. We show that Self‑Souping on corrupted test data, then fine‑tuning back on uncorrupted train data, boosts robustness by +3.5% (ImageNet‑C) and +7% (LAION‑C). Self‑\emphSoupervision also unlocks countless SSL algorithms to cook the diverse ingredients needed for more robust soups. We show for the first time that ingredients can differ in their SSL hyperparameters ‑‑ and more surprisingly, in their SSL algorithms. We cook soups of MAE, MoCoV3, and MMCR ingredients that are more accurate than any one single SSL ingredient.

Authors:Ali Abbasi, Chayne Thrash, Haoran Qin, Shansita Sharma, Sepehr Seifi, Soheil Kolouri
Title: Zero Sum SVD: Balancing Loss Sensitivity for Low Rank LLM Compression
Abstract:
Advances in large language models have driven strong performance across many tasks, but their memory and compute costs still hinder deployment. SVD‑based compression reduces storage and can speed up inference via low‑rank factors, yet performance depends on how rank is allocated under a global compression ratio. Prior methods often use homogeneous ranks for similarly sized matrices, despite large differences in loss sensitivity, or rely on expensive iterative pre‑truncation optimization to determine per matrix ranks. We propose Zero Sum SVD (ZS‑SVD), a post‑training method that performs \emphglobal singular component selection using activation whitening and first‑order calibration loss estimates in whitened coordinates. ZS‑SVD prunes components across the whole model with a zero sum rule that keeps the cumulative predicted loss change near zero, automatically yielding heterogeneous ranks without solving a rank allocation optimization. Motivated by evidence that gradients near pretrained solutions exhibit low rank structure, we also introduce an optional lightweight correction that applies a single projected gradient update after truncation, followed by re‑truncation. Extensive experiments across multiple LLM architectures show consistent gains across diverse benchmarks and compression ratios. Code is available at https://github.com/mint‑vu/Zero‑Sum‑SVD

Authors:Michael Ogezi, Martin Bell, Freda Shi, Ethan Smith
Title: From Tokens to Numbers: Continuous Number Modeling for SVG Generation
Abstract:
For certain image generation tasks, vector graphics such as Scalable Vector Graphics (SVGs) offer clear benefits such as increased flexibility, size efficiency, and editing ease, but remain less explored than raster‑based approaches. A core challenge is that the numerical, geometric parameters, which make up a large proportion of SVGs, are inefficiently encoded as long sequences of tokens. This slows training, reduces accuracy, and hurts generalization. To address these problems, we propose Continuous Number Modeling (CNM), an approach that directly models numbers as first‑class, continuous values rather than discrete tokens. This formulation restores the mathematical elegance of the representation by aligning the model's inputs with the data's continuous nature, removing discretization artifacts introduced by token‑based encoding. We then train a multimodal transformer on 2 million raster‑to‑SVG samples, followed by fine‑tuning via reinforcement learning using perceptual feedback to further improve visual quality. Our approach improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. This work establishes CNM as a practical and efficient approach for high‑quality vector generation, with potential for broader applications. We make our code available http://github.com/mikeogezi/CNM.

Authors:Matteo Bastico, Pierre Onghena, David Ryckelynck, Beatriz Marcotegui, Santiago Velasco-Forero, Laurent Corté, Caroline Robine--Decourcelle, Etienne Decencière
Title: LmPT: Conditional Point Transformer for Anatomical Landmark Detection on 3D Point Clouds
Abstract:
Accurate identification of anatomical landmarks is crucial for various medical applications. Traditional manual landmarking is time‑consuming and prone to inter‑observer variability, while rule‑based methods are often tailored to specific geometries or limited sets of landmarks. In recent years, anatomical surfaces have been effectively represented as point clouds, which are lightweight structures composed of spatial coordinates. Following this strategy and to overcome the limitations of existing landmarking techniques, we propose Landmark Point Transformer (LmPT), a method for automatic anatomical landmark detection on point clouds that can leverage homologous bones from different species for translational research. The LmPT model incorporates a conditioning mechanism that enables adaptability to different input types to conduct cross‑species learning. We focus the evaluation of our approach on femoral landmarking using both human and newly annotated dog femurs, demonstrating its generalization and effectiveness across species. The code and dog femur dataset will be publicly available at: https://github.com/Pierreoo/LandmarkPointTransformer.

Authors:Reza Rezvan, Gustav Gille, Moritz Schauer, Richard Torkar
Title: Causality--Δ: Jacobian-Based Dependency Analysis in Flow Matching Models
Abstract:
Flow matching learns a velocity field that transports a base distribution to data. We study how small latent perturbations propagate through these flows and show that Jacobian‑vector products (JVPs) provide a practical lens on dependency structure in the generated features. We derive closed‑form expressions for the optimal drift and its Jacobian in Gaussian and mixture‑of‑Gaussian settings, revealing that even globally nonlinear flows admit local affine structure. In low‑dimensional synthetic benchmarks, numerical JVPs recover the analytical Jacobians. In image domains, composing the flow with an attribute classifier yields an attribute‑level JVP estimator that recovers empirical correlations on MNIST and CelebA. Conditioning on small classifier‑Jacobian norms reduces correlations in a way consistent with a hypothesized common‑cause structure, while we emphasize that this conditioning is not a formal do intervention.

Authors:Viresh Pati, Yubin Kim, Vinh Pham, Jevon Twitty, Shihao Yang, Jiecheng Lu
Title: CAPS: Unifying Attention, Recurrence, and Alignment in Transformer-based Time Series Forecasting
Abstract:
This paper presents CAPS (Clock‑weighted Aggregation with Prefix‑products and Softmax), a structured attention mechanism for time series forecasting that decouples three distinct temporal structures: global trends, local shocks, and seasonal patterns. Standard softmax attention entangles these through global normalization, while recent recurrent models sacrifice long‑term, order‑independent selection for order‑dependent causal structure. CAPS combines SO(2) rotations for phase alignment with three additive gating paths ‑‑ Riemann softmax, prefix‑product gates, and a Clock baseline ‑‑ within a single attention layer. We introduce the Clock mechanism, a learned temporal weighting that modulates these paths through a shared notion of temporal importance. Experiments on long‑ and short‑term forecasting benchmarks surpass vanilla softmax and linear attention mechanisms and demonstrate competitive performance against seven strong baselines with linear complexity. Our code implementation is available at https://github.com/vireshpati/CAPS‑Attention.

Authors:Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, Andrea Zanette
Title: Maximum Likelihood Reinforcement Learning
Abstract:
Reinforcement learning is the method of choice to train models in sampling‑based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower‑order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling‑based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non‑differentiable sampling by defining a compute‑indexed family of sample‑based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy‑gradient estimator and converge to maximum likelihood optimization in the infinite‑compute limit. Empirically, we show that MaxRL Pareto‑dominates existing methods in all models and tasks we tested, achieving up to 20x test‑time scaling efficiency gains compared to its GRPO‑trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.

Authors:Punya Syon Pandey, Zhijing Jin
Title: BinaryPPO: Efficient Policy Optimization for Binary Classification
Abstract:
Supervised fine‑tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real‑world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence‑weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain‑specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40‑60 percentage points, reaching up to 99%, substantially outperforming supervised baselines. We provide an in‑depth analysis of the role of reward shaping, advantage scaling, and policy stability in enabling this improvement. Overall, we demonstrate that confidence‑based reward design provides a robust alternative to SFT for binary classification. Our code is available at https://github.com/psyonp/BinaryPPO.

Authors:Md Ishtyaq Mahmud, Veena Kochat, Suresh Satpati, Jagan Mohan Reddy Dwarampudi, Humaira Anzum, Kunal Rai, Tania Banerjee
Title: hSNMF: Hybrid Spatially Regularized NMF for Image-Derived Spatial Transcriptomics
Abstract:
High‑resolution spatial transcriptomics platforms, such as Xenium, generate single‑cell images that capture both molecular and spatial context, but their extremely high dimensionality poses major challenges for representation learning and clustering. In this study, we analyze data from the Xenium platform, which captures high‑resolution images of tumor microarray (TMA) tissues and converts them into cell‑by‑gene matrices suitable for computational analysis. We benchmark and extend nonnegative matrix factorization (NMF) for spatial transcriptomics by introducing two spatially regularized variants. First, we propose Spatial NMF (SNMF), a lightweight baseline that enforces local spatial smoothness by diffusing each cell's NMF factor vector over its spatial neighborhood. Second, we introduce Hybrid Spatial NMF (hSNMF), which performs spatially regularized NMF followed by Leiden clustering on a hybrid adjacency that integrates spatial proximity (via a contact‑radius graph) and transcriptomic similarity through a tunable mixing parameter alpha. Evaluated on a cholangiocarcinoma dataset, SNMF and hSNMF achieve markedly improved spatial compactness (CHAOS < 0.004, Moran's I > 0.96), greater cluster separability (Silhouette > 0.12, DBI < 1.8), and higher biological coherence (CMC and enrichment) compared to other spatial baselines. Availability and implementation: https://github.com/ishtyaqmahmud/hSNMF

Authors:Xiaoce Wang, Guibin Zhang, Junzhe Li, Jinzhe Tu, Chun Li, Ming Li
Title: ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents
Abstract:
Existing GUI agent models relying on coordinate‑based one‑step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate‑free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi‑step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre‑trained large language model to progressively acquire tool semantics, we construct an easy‑to‑hard curriculum consisting of three tasks: token definition question‑answering, pure text‑guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post‑training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open‑source at https://github.com/ZephinueCode/ToolTok.

Authors:Xianglong Yan, ChengZhu Bao, Zhiteng Li, Tianao Zhang, Shaoqiu Zhang, Ruobing Xie, Samm Sun, Yulun Zhang
Title: D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs
Abstract:
Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource‑constrained scenarios. Weight‑only post‑training quantization (PTQ) is appealing, as it reduces memory usage and enables practical speedup without low‑bit operators or specialized hardware. However, accuracy often degrades significantly in weight‑only PTQ at sub‑4‑bit precision, and our analysis identifies two main causes: (1) down‑projection matrices are a well‑known quantization bottleneck, but maintaining their fidelity often requires extra bit‑width; (2) weight quantization induces activation deviations, but effective correction strategies remain underexplored. To address these issues, we propose D^2Quant, a novel weight‑only PTQ framework that improves quantization from both the weight and activation perspectives. On the weight side, we design a Dual‑Scale Quantizer (DSQ) tailored to down‑projection matrices, with an absorbable scaling factor that significantly improves accuracy without increasing the bit budget. On the activation side, we propose Deviation‑Aware Correction (DAC), which incorporates a mean‑shift correction within LayerNorm to mitigate quantization‑induced activation distribution shifts. Extensive experiments across multiple LLM families and evaluation metrics show that D^2Quant delivers superior performance for weight‑only PTQ at sub‑4‑bit precision. The code and models will be available at https://github.com/XIANGLONGYAN/D2Quant.

Authors:Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Yongcheng Jing, Dacheng Tao
Title: SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models
Abstract:
While Diffusion Language Models (DLMs) offer a flexible, arbitrary‑order alternative to the autoregressive paradigm, their non‑causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing DLM caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) costly token‑wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden state dynamics. To address these challenges, we present SPA‑Cache that jointly optimizes update identification and budget allocation in DLM cache. First, we derive a low‑dimensional singular proxy that enables the identification of update‑critical tokens in a low‑dimensional subspace, substantially reducing the overhead of update identification. Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an 8× throughput improvement over vanilla decoding and a 2‑‑4× speedup over existing caching baselines.

Authors:Tianle Gu, Kexin Huang, Lingyu Li, Ruilin Luo, Shiyang Huang, Zongqi Wang, Yujiu Yang, Yan Teng, Yingchun Wang
Title: From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation
Abstract:
Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision‑making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi‑dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi‑head scalar reward model (UniRM). UniRM provides multi‑dimensional supervision by assigning attribute‑level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task‑specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi‑task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40% of the training data used by leading baselines. Ablations further validate our multi‑attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \hrefhttps://trustworthylab.github.io/UniMod/project website.

Authors:Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma
Title: Measuring Individual User Fairness with User Similarity and Effectiveness Disparity
Abstract:
Individual user fairness is commonly understood as treating similar users similarly. In Recommender Systems (RSs), several evaluation measures exist for quantifying individual user fairness. These measures evaluate fairness via either: (i) the disparity in RS effectiveness scores regardless of user similarity, or (ii) the disparity in items recommended to similar users regardless of item relevance. Both disparity in recommendation effectiveness and user similarity are very important in fairness, yet no existing individual user fairness measure simultaneously accounts for both. In brief, current user fairness evaluation measures implement a largely incomplete definition of fairness. To fill this gap, we present Pairwise User unFairness (PUF), a novel evaluation measure of individual user fairness that considers both effectiveness disparity and user similarity. PUF is the only measure that can express this important distinction. We empirically validate that PUF does this consistently across 4 datasets and 7 rankers, and robustly when varying user similarity or effectiveness. In contrast, all other measures are either almost insensitive to effectiveness disparity or completely insensitive to user similarity. We contribute the first RS evaluation measure to reliably capture both user similarity and effectiveness in individual user fairness. Our code: https://github.com/theresiavr/PUF‑individual‑user‑fairness‑recsys.

Authors:Chen Hu, Qianxi Zhao, Yuming Li, Mingyu Zhou, Xiyin Li
Title: UNSO: Unified Newton Schulz Orthogonalization
Abstract:
The Newton‑Schulz (NS) iteration has gained increasing interest for its role in the Muon optimizer and the Stiefel manifold. However, the conventional NS iteration suffers from inefficiency and instability. Although various improvements have been introduced to NS iteration, they fail to deviate from the conventional iterative paradigm, which could increase computation burden largely due to the matrix products along the long dimension repeatedly. To address this, we consolidate the iterative structure into a unified framework, named Unified Newton‑Schulz Orthogonalization (UNSO). To do so, we could avoid a polynomial expansion. Instead, we evaluate the role of each matrix power, remove the insignificant terms, and provide a recommended polynomial with learnable coefficients. These learnable coefficients are then optimized, and achieve an outstanding performance with stable convergence. The code of our method is available: https://github.com/greekinRoma/Unified_Newton_Schulz_Orthogonalization.

Authors:Dulhan Jayalath, Oiwi Parker Jones
Title: MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training
Abstract:
Clinical brain‑to‑text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre‑training improves data‑efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre‑train with only a few seconds of context. Thus, we propose MEG‑XL, a model pre‑trained with 2.5 minutes of MEG context per sample, 5‑300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine‑tuning on the task of word decoding from brain data, MEG‑XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre‑trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long‑context pre‑training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at https://github.com/neural‑processing‑lab/MEG‑XL .

Authors:Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang
Title: RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System
Abstract:
We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed‑loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step‑wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory‑motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3‑VL‑8B‑Thinking by 9.1% on OSWorld and Qwen2.5‑7B‑Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward‑model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen‑Verse/Open‑AgentRL

Authors:Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang
Title: MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Abstract:
Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand‑designed operations for extracting memory. These fixed procedures hard‑code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present MemSkill, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emphcontroller that learns to select a small set of relevant skills, paired with an LLM‑based \emphexecutor that produces skill‑guided memories. Beyond learning skill selection, MemSkill introduces a \emphdesigner that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed‑loop procedure that improves both the skill‑selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self‑evolving memory management for LLM agents.

Authors:Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang
Title: Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Abstract:
Methods for controlling large language models (LLMs), including local weight fine‑tuning, LoRA‑based adaptation, and activation‑based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference‑utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task‑valid generation, and measures both on a shared log‑odds scale using polarity‑paired contrastive examples. Across methods, we observe a consistent trade‑off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target‑concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid‑generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

Authors:Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi
Title: Advancing General-Purpose Reasoning Models with Modular Gradient Surgery
Abstract:
Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open‑ended reasoning. However, training a single general‑purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross‑domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi‑task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi‑domain RL and presents an effective solution for training general‑purpose LRMs.

Authors:Zheng Li, Jerry Cheng, Huanying Gu
Title: AROpt: An Optimization Method for Autoregressive Time Series Forecasting
Abstract:
Current time‑series forecasting models are primarily based on transformer‑style neural networks. These models achieve long‑term forecasting mainly by scaling up the model size rather than through genuinely autoregressive (AR) rollout. From the perspective of large language model training, traditional time‑series forecasting model training ignores the monotonic error‑growth heuristic. In this paper, we propose a novel training method for time‑series forecasting that enforces two key properties: (1) AR prediction errors should increase with the forecasting horizon. Violations of this trend are interpreted as rollout inconsistency and are softly penalized during training, and (2) the method enables models to be able to concatenate short‑term AR predictions to form flexible long‑term forecasts. Empirical results demonstrate that our method establishes a new state‑of‑the‑art across multiple benchmarks, achieving an MSE reduction of more than 10% compared to iTransformer and other recent strong baselines. Furthermore, it enables short‑horizon forecasting models to perform reliable long‑term predictions at horizons over 7.5 times longer. Code is available at https://github.com/LizhengMathAi/AROpt

Authors:Roman Dyachenko, Nikita Gushchin, Kirill Sokolov, Petr Mokrov, Evgeny Burnaev, Alexander Korotin
Title: Variational Entropic Optimal Transport
Abstract:
Entropic optimal transport (EOT) in continuous spaces with quadratic cost is a classical tool for solving the domain translation problem. In practice, recent approaches optimize a weak dual EOT objective depending on a single potential, but doing so is computationally not efficient due to the intractable log‑partition term. Existing methods typically resolve this obstacle in one of two ways: by significantly restricting the transport family to obtain closed‑form normalization (via Gaussian‑mixture parameterizations), or by using general neural parameterizations that require simulation‑based training procedures. We propose Variational Entropic Optimal Transport (VarEOT), based on an exact variational reformulation of the log‑partition \log \mathbbE[\exp(\cdot)] as a tractable minimization over an auxiliary log‑normalizer. This yields a differentiable learning objective optimized with stochastic gradients and avoids the necessity of MCMC simulations during the training. We provide theoretical guarantees, including finite‑sample generalization bounds and approximation results under universal function approximation. Experiments on synthetic data and unpaired image‑to‑image translation demonstrate competitive or improved translation quality, while comparisons within the solvers that use the same weak dual EOT objective support the benefit of the proposed optimization principle. The code for our solver can be found at https://github.com/DrEternity/VarEOT .

Authors:Tong Yang, Yemin Wang, Chaoning Zhang, Aming Wu
Title: Fat-Cat: Document-Driven Metacognitive Multi-Agent System for Complex Reasoning
Abstract:
The effectiveness of LLM‑based agents is often limited not by model capacity alone, but by how efficiently contextual information is utilized at runtime. Existing agent frameworks rely on rigid, syntax‑heavy state representations such as nested JSON, which require models to devote a substantial portion of their limited attention to syntactic processing rather than semantic reasoning. In this paper, we propose Fat‑Cat, a document‑driven agent architecture that improves the signal‑to‑noise ratio of state management. By integrating three key components: (1) a Semantic File System that represents agent state as Markdown documents aligned with common pre‑training corpora, (2) a Textual Strategy Evolution module that accumulates task‑solving knowledge without parameter updates, and (3) a Closed‑Loop Watcher that monitors reasoning trajectories to reduce hallucinations. Extensive reasoning, retrieval, and coding benchmarks, Fat‑Cat consistently improves agent performance. It enables the Kimi‑k2 model to outperform the proprietary GPT‑4o baseline on HotPotQA. Replacing the document‑based state with JSON leads to performance drop, while empirically validating the critical necessity of document‑driven state modeling over rigid syntax. The code is available at https://github.com/answeryt/Fat‑Cat.

Authors:Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao
Title: Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision‑DeepResearch systems that use search engines for complex visual‑textual fact‑finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search‑centric: answers that should require visual search are often leaked through cross‑textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image‑search side, the required information can often be obtained via near‑exact matching against the full image, while the text‑search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision‑DeepResearch benchmark (VDR‑Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi‑stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision‑DeepResearch systems under realistic real‑world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi‑round cropped‑search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep‑research systems. The code will be released in https://github.com/Osilly/Vision‑DeepResearch.

Authors:Pawel Batorski, Paul Swoboda
Title: EvoMU: Evolutionary Machine Unlearning
Abstract:
Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine‑tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well in one setting but over‑unlearn or under‑unlearn in another. Our approach EvoMU tackles these two challenges simultaneously. An evolutionary search procedure automatically finds task‑specific losses in the vast space of possible unlearning loss functions. This allows us to find dataset‑specific losses that match or outperform existing losses from the literature, without the need for a human‑in‑the‑loop. This work is therefore an instance of automatic scientific discovery, a.k.a. an AI co‑scientist. In contrast to previous AI co‑scientist works, we do so on a budget: We achieve SotA results using a small 4B parameter model (Qwen3‑4B‑Thinking), showing the potential of AI co‑scientists with limited computational resources. Our experimental evaluation shows that we surpass previous loss‑based unlearning formulations on TOFU‑5%, TOFU‑10%, MUSE and WMDP by synthesizing novel unlearning losses. Our code is available at https://github.com/Batorskq/EvoMU.

Authors:Nima Shoghi, Yuxuan Liu, Yuning Shen, Rob Brekelmans, Pan Li, Quanquan Gu
Title: Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics
Abstract:
Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long‑horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio‑temporal dynamics. We present STAR‑MD (Spatio‑Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)‑equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio‑temporal attention that efficiently captures complex space‑time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR‑MD achieves state‑of‑the‑art performance across all metrics‑‑substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR‑MD successfully extrapolates to generate stable microsecond‑scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long‑horizon generation, while demonstrating that STAR‑MD's joint spatio‑temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.

Authors:Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou
Title: How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning
Abstract:
Chain‑of‑thought (CoT) reasoning has become a central mechanism for eliciting multi‑step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps still remain crucial for tasks requiring compositional computation. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele‑Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a sparse set of pivot positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele‑lens.

Authors:Yoonjun Cho, Dongjae Jeon, Soeun Kim, Moongyu Jeon, Albert No
Title: Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs
Abstract:
Quantization Error Reconstruction (QER) reduces accuracy loss in Post‑Training Quantization (PTQ) by approximating weights as \mathbfW \approx \mathbfQ + \mathbfL\mathbfR, using a rank‑r correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when \mathbfW has intrinsic low‑rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank‑allocation framework that preserves the top‑k singular subspace of the activation‑scaled weight before quantization, quantizes only the residual, and uses the remaining rank r‑k for error reconstruction. We derive a theory‑guided criterion for selecting k by balancing quantization‑exposed energy and unrecoverable error under rank constraints. We further show that resulting \mathbfQ + \mathbfL\mathbfR parameterization naturally supports Quantized Parameter‑Efficient Fine‑Tuning (QPEFT), and stabilizes fine‑tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage‑point average gain on GLUE under 2‑bit QPEFT. The project page is available at https://ai‑isl.github.io/srr.

Authors:Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou
Title: SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning
Abstract:
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real‑world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding‑related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture‑of‑Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task‑relevant directions. To mitigate expert drift, we regulate expert updates via curvature‑aware scaling using historical input covariance in a rehearsal‑free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross‑task interference. We also introduce a new benchmark to evaluate MCIT with long task sequence, and extensive experiments demonstrate SAME's SOTA performance. Code is available at https://github.com/LAMDA‑CL/Prism.

Authors:Hongwei Yan, Guanglong Sun, Kanglei Zhou, Qian Li, Liyuan Wang, Yi Zhong
Title: FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning
Abstract:
General continual learning (GCL) challenges intelligent systems to learn from single‑pass, non‑stationary data streams without clear task boundaries. While recent advances in continual parameter‑efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly's hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain‑inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance‑level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt's superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state‑of‑the‑art baselines on CIFAR‑100, ImageNet‑R, and CUB‑200, respectively. Our source code is available at https://github.com/AnAppleCore/FlyGCL.

Authors:Zishuo Lan, Junjie Li, Lei Wang, Jincheng Wang
Title: FluxNet: Learning Capacity-Constrained Local Transport Operators for Conservative and Bounded PDE Surrogates
Abstract:
Autoregressive learning of time‑stepping operators provides an effective approach to data‑driven partial differential equation (PDE) simulation, yet for conservation laws, they face a fundamental challenge: learned updates may violate global conservation over long rollouts. For the important subclass of mass‑conservation‑type equations, the problem is compounded by inherent physical bounds (e.g., nonnegativity or concentrations in [0,1]) whose violation further destabilizes predictions. We introduce FluxNet, which learns cumulative transport amounts representing the total conserved quantity redistributed between each cell and a configurable neighborhood over the full surrogate interval. A conservative update guarantees exact discrete conservation by construction; modular capacity‑constrained transport heads (L, U, and D) enforce lower bounds, upper bounds, or near‑zero dual‑bound violations through architectural design. Unlike flux‑rate surrogates that require temporal integration and thus inherit CFL constraints, FluxNet involves no such integration; configurable transport neighborhoods enable large‑timestep prediction at full spatial resolution. Ghost cells extend the framework to non‑periodic boundaries. Experiments on four benchmarks (1D convection‑‑diffusion, 2D shallow water, 1D traffic flow, 2D Cahn‑‑Hilliard) demonstrate exact conservation, structural bound preservation, architecture modularity, and superior stability over flux‑rate surrogates at large temporal strides. The code is publicly available at: https://github.com/Lan‑zs/FluxNet.

Authors:Wenbo Pan, Zhichao Liu, Xianlong Wang, Haining Yu, Xiaohua Jia
Title: Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs
Abstract:
Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a target span of M tokens within a context of length N requires O(MN) operations, making long‑context attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi‑token attribution method that employs span‑wise aggregation to compute attribution over multi‑token targets in a single pass, while maintaining faithfulness. Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long‑context retrieval (RULER) and multi‑step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over 130x speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop improves faithfulness by tracing importance through the reasoning chain.

Authors:Jinbin Bai, Yixuan Li, Yuchen Zhu, Yi Xin, Qingyu Shi, Aosong Feng, Xiaohong Liu, Molei Tao, Jianru Xue, Xiangtai Li, Ming-Hsuan Yang
Title: Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
Abstract:
Inference‑time compute has re‑emerged as a practical way to improve LLM reasoning. Most test‑time scaling (TTS) algorithms rely on autoregressive decoding, which is ill‑suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs' full generative potential remains an underexplored challenge. To address this, we propose Prism (Pruning, Remasking, and Integrated Self‑verification Method), an efficient TTS framework for dLLMs that (i) performs Hierarchical Trajectory Search (HTS) which dynamically prunes and reallocates compute in an early‑to‑mid denoising window, (ii) introduces Local branching with partial remasking to explore diverse implementations while preserving high‑confidence tokens, and (iii) replaces external verifiers with Self‑Verified Feedback (SVF) obtained via self‑evaluation prompts on intermediate completions. Across four mathematical reasoning and code generation benchmarks on three dLLMs, including LLaDA 8B Instruct, Dream 7B Instruct, and LLaDA 2.0‑mini, our Prism achieves a favorable performance‑efficiency trade‑off, matching best‑of‑N performance with substantially fewer function evaluations (NFE). The code is released at https://github.com/viiika/Prism.

Authors:Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng
Title: CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
Abstract:
The quadratic complexity and indefinitely growing key‑value (KV) cache of standard Transformers pose a major barrier to long‑context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug‑in module, CoMeT can be integrated into pre‑trained models with only minimal fine‑tuning. It operates on sequential data chunks, using a dual‑memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long‑range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine‑tuning on extremely long contexts, we introduce a novel layer‑level pipeline parallelism strategy. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine‑tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full‑attention baseline on summarization tasks. Its practical effectiveness is further validated on real‑world agent and user behavior QA tasks. The code is available at: https://github.com/LivingFutureLab/Comet

Authors:Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan
Title: Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post‑training. However, we show that modern reasoning post‑training induces an unintended exploration collapse: temperature‑based sampling no longer increases pass@n accuracy. Empirically, the final‑layer posterior of post‑trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth‑conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://github.com/AlbertTan404/Latent‑Exploration‑Decoding.

Authors:Hayeong Lee, JunHyeok Oh, Byung-Jun Lee
Title: TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning
Abstract:
The design of environments plays a critical role in shaping the development and evaluation of cooperative multi‑agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high‑throughput sandbox designed for reconfigurable multi‑agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade‑offs across a diverse spectrum of task complexities. Leveraging JAX for hardware‑accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku‑dmlab/TABX.

Authors:Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah
Title: SUSD: Structured Unsupervised Skill Discovery through State Factorization
Abstract:
Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI‑based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task‑relevant behaviors. Distance‑Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state‑space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine‑grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine‑grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi‑hosseini/SUSD.

Authors:Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le
Title: Spectral Text Fusion: A Frequency-Aware Approach to Multimodal Time-Series Forecasting
Abstract:
Multimodal time series forecasting is crucial in real‑world applications, where decisions depend on both numerical data and contextual signals. The core challenge is to effectively combine temporal numerical patterns with the context embedded in other modalities, such as text. While most existing methods align textual features with time‑series patterns one step at a time, they neglect the multiscale temporal influences of contextual information such as time‑series cycles and dynamic shifts. This mismatch between local alignment and global textual context can be addressed by spectral decomposition, which separates time series into frequency components capturing both short‑term changes and long‑term trends. In this paper, we propose SpecTF, a simple yet effective framework that integrates the effect of textual data on time series in the frequency domain. Our method extracts textual embeddings, projects them into the frequency domain, and fuses them with the time series' spectral components using a lightweight cross‑attention mechanism. This adaptively reweights frequency bands based on textual relevance before mapping the results back to the temporal domain for predictions. Experimental results demonstrate that SpecTF significantly outperforms state‑of‑the‑art models across diverse multi‑modal time series datasets while utilizing considerably fewer parameters. Code is available at https://github.com/hiepnh137/SpecTF.

Authors:Quang Truong, Yu Song, Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang
Title: Plain Transformers are Surprisingly Powerful Link Predictors
Abstract:
Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. While Graph Neural Networks (GNNs) are the standard solution, state‑of‑the‑art pipelines often rely on explicit structural heuristics or memory‑intensive node embeddings ‑‑ approaches that struggle to generalize or scale to massive graphs. Emerging Graph Transformers (GTs) offer a potential alternative but often incur significant overhead due to complex structural encodings, hindering their applications to large‑scale link prediction. We challenge these sophisticated paradigms with PENCIL, an encoder‑only plain Transformer that replaces hand‑crafted priors with attention over sampled local subgraphs, retaining the scalability and hardware efficiency of standard Transformers. Through experimental and theoretical analysis, we show that PENCIL extracts richer structural signals than GNNs, implicitly generalizing a broad class of heuristics and subgraph‑based expressivity. Empirically, PENCIL outperforms heuristic‑informed GNNs and is far more parameter‑efficient than ID‑embedding‑‑based alternatives, while remaining competitive across diverse benchmarks ‑‑ even without node features. Our results challenge the prevailing reliance on complex engineering techniques, demonstrating that simple design choices are potentially sufficient to achieve the same capabilities. Our code is publicly available at https://github.com/quang‑truong/pencil.

Authors:Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
Title: MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
Abstract:
Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their reliance on static, pre‑collected data distributions. In this paper, we introduce MAGIC, a novel multi‑turn multi‑agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a co‑evolution, where the attacker's ever‑changing strategies continuously uncover long‑tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves novel, previously unseen combinatorial strategies through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.

Authors:Muheng Li, Jian Qian, Wenlong Mou
Title: Predicting and improving test-time scaling laws via reward tail-guided search
Abstract:
Test‑time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight‑forward ''best‑of‑N'' (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of N, budget allocation, and multi‑stage decision‑making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail‑guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling‑Law Guided (SLG) Search, a new test‑time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect‑information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail‑guided allocation consistently achieves higher reward yields than Best‑of‑N under identical compute budgets. Our code is available at https://github.com/PotatoJnny/Scaling‑Law‑Guided‑search.

Authors:Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja Güldenring, Florian T. Pokorny, Lazaros Nalpantidis
Title: Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas
Abstract:
We propose Parabolic Position Encoding (PaPE), a parabola‑based position encoding for vision modalities in attention‑based architectures. Given a set of vision tokens‑such as images, point clouds, videos, or event camera streams‑our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D‑sequences in language to nD‑structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE‑RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE‑RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet‑1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next‑best position encoding. Code is available at https://github.com/DTU‑PAS/parabolic‑position‑encoding.

Authors:Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya
Title: SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling
Abstract:
This paper addresses the challenge of audio‑visual single‑microphone speech separation and enhancement in the presence of real‑world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \acWER across all conditions. We further extend our framework to handle off‑screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/

Authors:Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park
Title: PromptRL: Prompt Matters in RL for Flow-Based Image Generation
Abstract:
Flow matching models (FMs) have revolutionized text‑to‑image (T2I) generation, with reinforcement learning (RL) serving as a critical post‑training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow‑Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow‑based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state‑of‑the‑art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large‑scale image editing models, improving the EditReward of FLUX.1‑Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine‑grained data annotations along with a complex multi‑stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2× fewer rollouts compared to naive flow‑only RL. Our code is available at https://github.com/G‑U‑N/UniRL.

Authors:Kangjun Noh, Seongchan Lee, Ilmun Kim, Kyungwoo Song
Title: Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses
Abstract:
Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high‑stakes domains such as medicine and law. Conformal inference provides distribution‑free guarantees, but existing approaches are either overly conservative, discarding many true‑claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim‑level scores. Our method, Multi‑LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality‑scores, which in our experiments led to higher retention, while validity is preserved through group‑conditional calibration. Experiments show that MACI consistently achieves user‑specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI‑Yonsei/MACI

Authors:Jiayu Bai, Danchen Yu, Zhenyu Liao, TianQi Hou, Feng Zhou, Robert C. Qiu, Zenan Ling
Title: Diving into Kronecker Adapters: Component Design Matters
Abstract:
Kronecker adapters have emerged as a promising approach for fine‑tuning large‑scale models, enabling high‑rank updates through tunable component structures. However, existing work largely treats the component structure as a fixed or heuristic design choice, leaving the dimensions and number of Kronecker components underexplored. In this paper, we identify component structure as a key factor governing the capacity of Kronecker adapters. We perform a fine‑grained analysis of both the dimensions and number of Kronecker components. In particular, we show that the alignment between Kronecker adapters and full fine‑tuning depends on component configurations. Guided by these insights, we propose Component Designed Kronecker Adapters (CDKA). We further provide parameter‑budget‑aware configuration guidelines and a tailored training stabilization strategy for practical deployment. Experiments across various natural language processing tasks demonstrate the effectiveness of CDKA. Code is available at https://github.com/rainstonee/CDKA.

Authors:Marco Chen, Xianbiao Qi, Yelin He, Jiaquan Ye, Rong Xiao
Title: SimpleGPT: Improving GPT via A Simple Normalization Strategy
Abstract:
In this work, we revisit Transformer optimization through the lens of second‑order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm‑based network, tolerates learning rates 3×‑10× larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well‑established baselines. Specifically, when training 7B‑scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at https://github.com/Ocram7/SimpleGPT.

Authors:Mengsha Kou, Xiaoyu Xia, Ziqi Wang, Ibrahim Khalil, Runkun Luo, Jingwen Zhou, Minhui Xue
Title: WinFLoRA: Incentivizing Client-Adaptive Aggregation in Federated LoRA under Privacy Heterogeneity
Abstract:
Large Language Models (LLMs) increasingly underpin intelligent web applications, from chatbots to search and recommendation, where efficient specialization is essential. Low‑Rank Adaptation (LoRA) enables such adaptation with minimal overhead, while federated LoRA allows web service providers to fine‑tune shared models without data sharing. However, in privacy‑sensitive deployments, clients inject varying levels of differential privacy (DP) noise, creating privacy heterogeneity that misaligns individual incentives and global performance. In this paper, we propose WinFLoRA, a privacy‑heterogeneous federated LoRA that utilizes aggregation weights as incentives with noise awareness. Specifically, the noises from clients are estimated based on the uploaded LoRA adapters. A larger weight indicates greater influence on the global model and better downstream task performance, rewarding lower‑noise contributions. By up‑weighting low‑noise updates, WinFLoRA improves global accuracy while accommodating clients' heterogeneous privacy requirements. Consequently, WinFLoRA aligns heterogeneous client utility in terms of privacy and downstream performance with global model objectives without third‑party involvement. Extensive evaluations demonstrate that across multiple LLMs and datasets, WinFLoRA achieves up to 52.58% higher global accuracy and up to 2.56x client utility than state‑of‑the‑art benchmarks. Source code is publicly available at https://github.com/koums24/WinFLoRA.git.

Authors:Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun
Title: SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models
Abstract:
Mixed‑precision quantization is a promising approach for compressing large language models under tight memory budgets. However, existing mixed‑precision methods typically suffer from one of two limitations: they either rely on expensive discrete optimization to determine precision allocation, or introduce hardware inefficiencies due to irregular memory layouts. We propose SFMP, a search‑free and hardware‑friendly mixed‑precision quantization framework for large language models. The framework is built upon four novel ideas: Fractional bit‑width, which extends integer bit‑width for weight matrix to fractional value and transforms discrete precision allocation as a continuous problem; 2)Block‑wise mixed‑precision, enabling fine‑grained precision within weight matrices while remaining hardware‑friendly; 3)Row‑column weight reordering, which aggregates salient weights via row and column reordering, incurring only a small activation reordering overhead during inference; 4)Unified GEMM kernel, which supports mixed‑precision GEMM at arbitrary average bit‑width. Extensive experiments demonstrate that SFMP outperforms state‑of‑the‑art layer‑wise mixed‑precision methods under the same memory constraints, while significantly reducing quantization cost and improving inference efficiency. Code is available at https://github.com/Nkniexin/SFMP

Authors:Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang
Title: Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
Abstract:
Vision‑language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image‑based jailbreaks crafted to induce harmful responses. Existing gradient‑based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white‑box surrogate and fail to generalise to black‑box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic‑based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision‑level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \hrefhttps://github.com/kaiyuanCui/UltraBreakGitHub repository.

Authors:Yuheng Yang, Siqi Zhu, Tao Feng, Ge Liu, Jiaxuan You
Title: Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction
Abstract:
Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundaries extend. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularities. To ensure the quality of extracted knowledge, we introduce a three‑stage knowledge processing pipeline that combines vector‑based filtering to remove exact duplicates, LLM‑based adjudication to resolve ambiguous semantic overlaps, and domain‑relevance auditing to retain valid knowledge units. Through extensive experiments, we find that recursive taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently extract more knowledge. In addition, we identify a Pass@1‑versus‑Pass@k trade‑off: domain‑specialized models achieve higher initial accuracy but degrade rapidly, while general‑purpose models maintain stable performance during extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families.

Authors:Víctor Yeste, Paolo Rosso
Title: Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts
Abstract:
Sentence‑level human value detection is typically framed as multi‑label classification over Schwartz values, but it remains unclear whether Schwartz higher‑order (HO) categories provide usable structure. We study this under a strict compute‑frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO\rightarrowvalues pipelines that enforce the hierarchy with hard masks, and (iii) Presence\rightarrowHO\rightarrowvalues cascades, alongside low‑cost add‑ons (lexica, short context, topics), label‑wise threshold tuning, small instruction‑tuned LLM baselines (\le10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro‑F_1\approx0.58), but hard hierarchical gating is not a reliable win: it often reduces end‑task Macro‑F_1 via error compounding and recall suppression. In contrast, label‑wise threshold tuning is a high‑leverage knob (up to +0.05 Macro‑F_1), and small transformer ensembles provide the most consistent additional gains (up to +0.02 Macro‑F_1). Small LLMs lag behind supervised encoders as stand‑alone systems, yet can contribute complementary errors in cross‑family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence‑level value detection; robust improvements come from calibration and lightweight ensembling.

Authors:Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang
Title: EffGen: Enabling Small Language Models as Capable Autonomous Agents
Abstract:
Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open‑source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool‑calling with prompt optimization that compresses contexts by 70‑80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity‑based routing using five factors to make smart pre‑execution decisions, and (4) Unified memory system combining short‑term, long‑term, and vector‑based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross‑protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (https://effgen.org/) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at https://github.com/ctrl‑gaurav/effGen.

Authors:Yuhao Huang, Taos Transue, Shih-Hsin Wang, William Feldman, Hong Zhang, Bao Wang
Title: Improving Flow Matching by Aligning Flow Divergence
Abstract:
Conditional flow matching (CFM) stands out as an efficient, simulation‑free approach for training flow‑based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation gap between the two probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow‑based generative model by a noticeable margin without sacrificing generation efficiency. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos. Code is available at \hrefhttps://github.com/Utah‑Math‑Data‑Science/Flow_Div_MatchingUtah‑Math‑Data‑Science.

Authors:Yakun Wang, Leyang Wang, Song Liu, Taiji Suzuki
Title: Zero-Flow Encoders
Abstract:
Flow‑based methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. However, few existing works have exploited this unique capability to resolve fine‑grained structural details beyond generation tasks. This paper presents a flow‑inspired framework for representation learning. First, we demonstrate that a rectified flow trained using independent coupling is zero everywhere at t=0.5 if and only if the source and target distributions are identical. We term this property the \emphzero‑flow criterion. Second, we show that this criterion can certify conditional independence, thereby extracting \emphsufficient information from the data. Third, we translate this criterion into a tractable, simulation‑free loss function that enables learning amortized Markov blankets in graphical models and latent representations in self‑supervised learning tasks. Experiments on both simulated and real‑world datasets demonstrate the effectiveness of our approach. The code reproducing our experiments can be found at: https://github.com/probabilityFLOW/zfe.

Authors:Hao Gu, Mao-Lin Luo, Zi-Hao Zhou, Han-Chen Zhang, Min-Ling Zhang, Tong Wei
Title: Spectral Imbalance Causes Forgetting in Low-Rank Continual Adaptation
Abstract:
Parameter‑efficient continual learning aims to adapt pre‑trained models to sequential tasks without forgetting previously acquired knowledge. Most existing approaches treat continual learning as avoiding interference with past updates, rather than considering what properties make the current task‑specific update naturally preserve previously acquired knowledge. From a knowledge‑decomposition perspective, we observe that low‑rank adaptations exhibit highly imbalanced singular value spectra: a few dominant components absorb most of the adaptation energy, thereby (i) more likely to disrupt previously acquired knowledge and (ii) making the update more vulnerable to interference from subsequent tasks. To enable explicit balance among components, we decouple the magnitude of the task update from its directional structure and formulate it as a constrained optimization problem on a restricted Stiefel manifold. We address this problem using a projected first‑order method compatible with standard deep‑learning optimizers used in vision‑language models. Our method mitigates both backward and forward forgetting, consistently outperforming continual learning baselines. The implementation code is available at https://github.com/haodotgu/EBLoRA.

Authors:Kezhao Lai, Yutao Lai, Hai-Lin Liu
Title: Beyond the Node: Clade-level Selection for Efficient MCTS in Automatic Heuristic Design
Abstract:
While Monte Carlo Tree Search (MCTS) shows promise in Large Language Model (LLM) based Automatic Heuristic Design (AHD), it suffers from a critical over‑exploitation tendency under the limited computational budgets required for heuristic evaluation. To address this limitation, we propose Clade‑AHD, an efficient framework that replaces node‑level point estimates with clade‑level Bayesian beliefs. By aggregating descendant evaluations into Beta distributions and performing Thompson Sampling over these beliefs, Clade‑AHD explicitly models uncertainty to guide exploration, enabling more reliable decision‑making under sparse and noisy evaluations. Extensive experiments on complex combinatorial optimization problems demonstrate that Clade‑AHD consistently outperforms state‑of‑the‑art methods while significantly reducing computational cost. The source code is publicly available at: https://github.com/Mriya0306/Clade‑AHD.

Authors:Xinmo Jin, Bowen Fan, Xunkai Li, Henan Sun, YuXin Zeng, Zekai Chen, Yuxuan Sun, Jia Li, Qiangqiang Dai, Hongchao Qin, Rong-Hua Li, Guoren Wang
Title: OpenDDI: A Comprehensive Benchmark for DDI Prediction
Abstract:
Drug‑Drug Interactions (DDIs) significantly influence therapeutic efficacy and patient safety. As experimental discovery is resource‑intensive and time‑consuming, efficient computational methodologies have become essential. The predominant paradigm formulates DDI prediction as a drug graph‑based link prediction task. However, further progress is hindered by two fundamental challenges: (1) lack of high‑quality data: most studies rely on small‑scale DDI datasets and single‑modal drug representations; (2) lack of standardized evaluation: inconsistent scenarios, varied metrics, and diverse baselines. To address the above issues, we propose OpenDDI, a comprehensive benchmark for DDI prediction. Specifically, (1) from the data perspective, OpenDDI unifies 6 widely used DDI datasets and 2 existing forms of drug representation, while additionally contributing 3 new large‑scale LLM‑augmented datasets and a new multimodal drug representation covering 5 modalities. (2) From the evaluation perspective, OpenDDI unifies 20 SOTA model baselines across 3 downstream tasks, with standardized protocols for data quality, effectiveness, generalization, robustness, and efficiency. Based on OpenDDI, we conduct a comprehensive evaluation and derive 10 valuable insights for DDI prediction while exposing current limitations to provide critical guidance for this rapidly evolving field. Our code is available at https://github.com/xiaoriwuguang/OpenDDI

Authors:Apurba Prasad Padhy, Fernando Camacho, Saibal Mukhopadhyay
Title: AIRE-Prune: Asymptotic Impulse-Response Energy for State Pruning in State Space Models
Abstract:
State space models (SSMs) often sacrifice capacity, search space, or stability to offset the memory and compute costs of large state dimensions. We introduce a structured post‑training pruning method for SSMs ‑‑ AIRE‑Prune (Asymptotic Impulse‑Response Energy for State PRUN(E)) ‑‑ that reduces each layer's state dimension by directly minimizing long‑run output‑energy distortion. AIRE‑Prune assigns every state a closed‑form asymptotic impulse‑response energy‑based score, i.e., the total impulse‑response energy it contributes over an infinite horizon (time), and normalizes these scores layer‑wise to enable global cross‑layer comparison and selection. This extends modal truncation from single systems to deep stacks and aligns pruning with asymptotic response energy rather than worst‑case gain. Across diverse sequence benchmarks, AIRE‑Prune reveals substantial redundancy in SISO and MIMO SSMs with average pruning of 60.8%, with average accuracy drop of 0.29% without retraining, while significantly lowering compute. Code: https://github.com/falcon‑arrow/AIRE‑Prune.

Authors:Hengchang Liu, Zhao Yang, Bing Su
Title: Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly
Abstract:
Diffusion language models (DLMs) provide a bidirectional generation framework naturally suited for infilling, yet their performance is constrained by the pre‑specified infilling length. In this paper, we reveal that DLMs possess an inherent ability to discover the correct infilling length. We identify two key statistical phenomena in the first‑step denoising confidence: a local Oracle Peak that emerges near the ground‑truth length and a systematic Length Bias that often obscures this signal. By leveraging this signal and calibrating the bias, our training‑free method CAL (Calibrated Adaptive Length) enables DLMs to approximate the optimal length through an efficient search before formal decoding. Empirical evaluations demonstrate that CAL improves Pass@1 by up to 47.7% over fixed‑length baselines and 40.5% over chat‑based adaptive methods in code infilling, while boosting BLEU‑2 and ROUGE‑L by up to 8.5% and 9.9% in text infilling. These results demonstrate that CAL paves the way for robust DLM infilling without requiring any specialized training. Code is available at https://github.com/NiuHechang/Calibrated_Adaptive_Length.

Authors:Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Ziyan Weng, Yingwei Zhang
Title: PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Multimodal Agents
Abstract:
As multimodal agents evolve from passive observers to long‑horizon decision‑makers, they require memory systems that provide not just information availability but logical verifiability. A fundamental limitation of current architectures is the epistemic asymmetry inherent in probabilistic vision‑language models and dense associative memories: they conflate semantic affinity with factual existence and structurally fail to encode negative constraints. To this end, we introduce PolarMem, a training‑free Polarized Latent Graph Memory designed to ground agent reasoning in verifiable evidence. PolarMem transforms fuzzy perceptual likelihoods into discrete logical constraints through non‑parametric distributional partitioning. Furthermore, it employs a polarized graph topology with orthogonal inhibitory connections to explicitly store verified negation as a primary cognitive state. At inference time, we enforce a logic‑dominant retrieval paradigm, suppressing hallucinatory patterns that violate negative constraints. Extensive evaluation across eight frozen Vision‑‑Language Models and six benchmarks demonstrates that PolarMem functions as a robust cognitive system, establishing a foundation for verifiable multimodal agents. Our code is available at https://github.com/czs‑ict/PolarMem.

Authors:Gabriel Bromonschenkel, Alessandro L. Koerich, Thiago M. Paixão, Hilário Tomaz Alves de Oliveira
Title: Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset
Abstract:
Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English‑based models, low‑resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross‑native‑translated evaluation of Transformer‑based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross‑context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP‑Score metric to evaluate the image‑description alignment. Our findings show that Swin‑DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre‑trained VLM, surpasses larger multilingual models (GPT‑4o, LLaMa 3.2 Vision) in traditional text‑based evaluation metrics, while GPT‑4 models achieve the highest CLIP‑Score, highlighting improved image‑text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: https://github.com/laicsiifes/transformer‑caption‑ptbr.

Authors:Jie Yang, Yifan Hu, Yuante Li, Kexin Zhang, Kaize Ding, Philip S. Yu
Title: From Observations to States: Latent Time Series Forecasting
Abstract:
Deep learning has achieved strong performance in Time Series Forecasting (TSF). However, we identify a critical representation paradox, termed Latent Chaos: models with accurate predictions often learn latent representations that are temporally disordered and lack continuity. We attribute this phenomenon to the dominant observation‑space forecasting paradigm. Most TSF models minimize point‑wise errors on noisy and partially observed data, which encourages shortcut solutions instead of the recovery of underlying system dynamics. To address this issue, we propose Latent Time Series Forecasting (LatentTSF), a novel paradigm that shifts TSF from observation regression to latent state prediction. Specifically, LatentTSF employs an AutoEncoder to project observations at each time step into a higher‑dimensional latent state space. This expanded representation aims to capture underlying system variables and impose a smoother temporal structure. Forecasting is then performed entirely in the latent space, allowing the model to focus on learning structured temporal dynamics. Theoretical analysis demonstrates that our proposed latent objectives implicitly maximize mutual information between predicted latent states and ground‑truth states and observations. Extensive experiments on widely‑used benchmarks confirm that LatentTSF effectively mitigates latent chaos, achieving superior performance. Our code is available in https://github.com/Muyiiiii/LatentTSF.

Authors:Franz A. Heinsen, Leo Kozachkov
Title: Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation
Abstract:
The most widely used artificial intelligence (AI) models today are Transformers employing self‑attention. In its standard form, self‑attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self‑attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders‑of‑magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed‑forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial‑kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large‑scale Transformer models. The mathematical techniques we introduce are of independent interest.

Authors:Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci
Title: VoxServe: Streaming-Centric Serving System for Speech Language Models
Abstract:
Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model‑execution abstraction that decouples model architecture from system‑level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming‑aware scheduling and an asynchronous inference pipeline to improve end‑to‑end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10‑20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox‑serve/vox‑serve.

Authors:Tianyi Hu, Niket Tandon, Akhil Arora
Title: DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking
Abstract:
Existing retrieval‑augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information‑seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug‑and‑play agentic RAG framework with novel reflection‑guided generation and memory‑augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity‑quality trade‑off in open‑ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity‑quality trade‑off compared to competitive baselines and previous state‑of‑the‑art methods on the real‑world Infinity‑Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM‑based systems for open‑ended information‑seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: https://github.com/au‑clan/Diverge

Authors:Beier Zhu, Kesen Zhao, Jiequan Cui, Qianru Sun, Yuan Zhou, Xun Yang, Hanwang Zhang
Title: Reducing Class-Wise Performance Disparity via Margin Regularization
Abstract:
Deep neural networks often exhibit substantial disparities in class‑wise accuracy, even when trained on class‑balanced data, posing concerns for reliable deployment. While prior efforts have explored empirical remedies, a theoretical understanding of such performance disparities in classification remains limited. In this work, we present Margin Regularization for Performance Disparity Reduction (MR^2), a theoretically principled regularization for classification by dynamically adjusting margins in both the logit and representation spaces. Our analysis establishes a margin‑based, class‑sensitive generalization bound that reveals how per‑class feature variability contributes to error, motivating the use of larger margins for hard classes. Guided by this insight, MR^2 optimizes per‑class logit margins proportional to feature spread and penalizes excessive representation margins to enhance intra‑class compactness. Experiments on seven datasets, including ImageNet, and diverse pre‑trained backbones (MAE, MoCov2, CLIP) demonstrate that MR^2 not only improves overall accuracy but also significantly boosts hard class performance without trading off easy classes, thus reducing performance disparity. Code is available at: https://github.com/BeierZhu/MR2

Authors:Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren
Title: GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models
Abstract:
Diffusion models learn a time‑indexed score field \mathbfs_θ(\mathbfx_t,t) that often inherits approximate equivariances (flips, rotations, circular shifts) from in‑distribution (ID) data and convolutional backbones. Most diffusion‑based out‑of‑distribution (OOD) detectors exploit score magnitude or local geometry (energies, curvature, covariance spectra) and largely ignore equivariances. We introduce Group‑Equivariant Posterior Consistency (GEPC), a training‑free probe that measures how consistently the learned score transforms under a finite group \mathcalG, detecting equivariance breaking even when score magnitude remains unchanged. At the population level, we propose the ideal GEPC residual, which averages an equivariance‑residual functional over \mathcalG, and we derive ID upper bounds and OOD lower bounds under mild assumptions. GEPC requires only score evaluations and produces interpretable equivariance‑breaking maps. On OOD image benchmark datasets, we show that GEPC achieves competitive or improved AUROC compared to recent diffusion‑based baselines while remaining computationally lightweight. On high‑resolution synthetic aperture radar imagery where OOD corresponds to targets or anomalies in clutter, GEPC yields strong target‑background separation and visually interpretable equivariance‑breaking maps. Code is available at https://github.com/RouzAY/gepc‑diffusion/.

Authors:Soumyadip Sarkar
Title: MiniTensor: A Lightweight, High-Performance Tensor Operations Library
Abstract:
We present MiniTensor, an open source tensor operations library that focuses on minimalism, correctness, and performance. MiniTensor exposes a familiar PyTorch‑like Python API while it executes performance critical code in a Rust engine. The core supports dense n dimensional tensors, broadcasting, reductions, matrix multiplication, reverse mode automatic differentiation, a compact set of neural network layers, and standard optimizers. In this paper, we describe the design of MiniTensor's architecture, including its efficient memory management, dynamic computation graph for gradients, and integration with Python via PyO3. We also compare the install footprint with PyTorch and TensorFlow to demonstrate that MiniTensor achieves a package size of only a few megabytes, several orders of magnitude smaller than mainstream frameworks, while preserving the essentials needed for research and development on CPUs. The repository can be found at https://github.com/neuralsorcerer/minitensor

Authors:Ming-Yao Ho, Cheng-Kai Wang, You-Teng Lin, Hung-Hsuan Chen
Title: SCPL: Enhancing Neural Network Training Throughput with Decoupled Local Losses and Model Parallelism
Abstract:
Adopting large‑scale AI models in enterprise information systems is often hindered by high training costs and long development cycles, posing a significant managerial challenge. The standard end‑to‑end backpropagation (BP) algorithm is a primary driver of modern AI, but it is also the source of inefficiency in training deep networks. This paper introduces a new training methodology, Supervised Contrastive Parallel Learning (SCPL), that addresses this issue by decoupling BP and transforming a long gradient flow into multiple short ones. This design enables the simultaneous computation of parameter gradients in different layers, achieving superior model parallelism and enhancing training throughput. Detailed experiments are presented to demonstrate the efficiency and effectiveness of our model compared to BP, Early Exit, GPipe, and Associated Learning (AL), a state‑of‑the‑art method for decoupling backpropagation. By mitigating a fundamental performance bottleneck, SCPL provides a practical pathway for organizations to develop and deploy advanced information systems more cost‑effectively and with greater agility. The experimental code is released for reproducibility. https://github.com/minyaho/scpl/

Authors:Yue Yu, Ting Bai, HengZhi Lan, Li Qian, Li Peng, Jie Wu, Wei Liu, Jian Luan, Chuan Shi
Title: C$^2$-Cite: Contextual-Aware Citation Generation for Attributed Large Language Models
Abstract:
The attribution technique enhances the credibility of LLMs by adding citations to the generated sentences, enabling users to trace back to the original sources and verify the reliability of the output. However, existing instruction‑tuned attributed LLMs often fail to properly interpret the contextual semantics of citation symbols (e.g., [i]) during text generation. This shortcoming arises from their insufficient awareness of the context information surrounding citation markers, which in turn leads to disjointed references and poor integration of retrieved knowledge into the generated content. To address this issue, we propose a novel Contextual‑aware Citation generation framework (C^2‑Cite) that explicitly integrates the semantic relationships between citation markers and their referenced content. Specifically, a contextual citation alignment mechanism is adopted: it first encodes the retrieved document contexts into the symbol representation of citations, then aligns the marker numbers by decoding information from a citation router function. This mechanism enables the transformation of citation markers from generic placeholders into active knowledge pointers that link to the referenced source information. Experimental results on the ALCE benchmark across three datasets validate our framework C^2‑Cite++: it outperforms the SOTA baseline by an average of 5.8% in citation quality and 17.4% in response correctness. The implementation is publicly available at https://github.com/BAI‑LAB/c2cite

Authors:Yu Zheng, Chen Gao, Jianxin Chang, Yanan Niu, Yang Song, Depeng Jin, Meng Wang, Yong Li
Title: Disentangled Interest Network for Out-of-Distribution CTR Prediction
Abstract:
Click‑through rate (CTR) prediction, which estimates the probability of a user clicking on a given item, is a critical task for online information services. Existing approaches often make strong assumptions that training and test data come from the same distribution. However, the data distribution varies since user interests are constantly evolving, resulting in the out‑of‑distribution (OOD) issue. In addition, users tend to have multiple interests, some of which evolve faster than others. Towards this end, we propose Disentangled Click‑Through Rate prediction (DiseCTR), which introduces a causal perspective of recommendation and disentangles multiple aspects of user interests to alleviate the OOD issue in recommendation. We conduct a causal factorization of CTR prediction involving user interest, exposure model, and click model, based on which we develop a deep learning implementation for these three causal mechanisms. Specifically, we first design an interest encoder with sparse attention which maps raw features to user interests, and then introduce a weakly supervised interest disentangler to learn independent interest embeddings, which are further integrated by an attentive interest aggregator for prediction. Experimental results on three real‑world datasets show that DiseCTR achieves the best accuracy and robustness in OOD recommendation against state‑of‑the‑art approaches, significantly improving AUC and GAUC by over 0.02 and reducing logloss by over 13.7%. Further analyses demonstrate that DiseCTR successfully disentangles user interests, which is the key to OOD generalization for CTR prediction. We have released the code and data at https://github.com/DavyMorgan/DiseCTR/.

Authors:Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini
Title: FOCUS: DLLMs Know How to Tame Their Compute Bound
Abstract:
Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto‑Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non‑decodable tokens. We further observe a strong correlation between attention‑derived token importance and token‑wise decoding probability. Based on this insight, we propose FOCUS ‑‑ an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non‑decodable ones on‑the‑fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52× throughput improvement over the production‑grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands‑lab/FOCUS.

Authors:Luca Della Libera, Cem Subakan, Mirco Ravanelli
Title: Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
Abstract:
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character‑Aligned Speech Tokenizer that enables variable‑frame‑rate tokenization through soft character‑level alignment and explicit duration modeling. DyCAST learns to associate tokens with character‑level linguistic units during training and supports alignment‑free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval‑augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed‑frame‑rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.

Authors:Eugenia Iofinova, Dan Alistarh
Title: Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data
Abstract:
As artificial neural networks, and specifically large language models, have improved rapidly in capabilities and quality, they have increasingly been deployed in real‑world applications, from customer service to Google search, despite the fact that they frequently make factually incorrect or undesirable statements. This trend has inspired practical and academic interest in model editing, that is, in adjusting the weights of the model to modify its likely outputs for queries relating to a specific fact or set of facts. This may be done either to amend a fact or set of facts, for instance, to fix a frequent error in the training data, or to suppress a fact or set of facts entirely, for instance, in case of dangerous knowledge. Multiple methods have been proposed to do such edits. However, at the same time, it has been shown that such model editing can be brittle and incomplete. Moreover the effectiveness of any model editing method necessarily depends on the data on which the model is trained, and, therefore, a good understanding of the interaction of the training data distribution and the way it is stored in the network is necessary and helpful to reliably perform model editing. However, working with large language models trained on real‑world data does not allow us to understand this relationship or fully measure the effects of model editing. We therefore propose Behemoth, a fully synthetic data generation framework. To demonstrate the practical insights from the framework, we explore model editing in the context of simple tabular data, demonstrating surprising findings that, in some cases, echo real‑world results, for instance, that in some cases restricting the update rank results in a more effective update. The code is available at https://github.com/IST‑DASLab/behemoth.git.

Authors:Christiaan P. Opperman, Anna S. Bosman, Katherine M. Malan
Title: Regularisation in neural networks: a survey and empirical analysis of approaches
Abstract:
Despite huge successes on a wide range of tasks, neural networks are known to sometimes struggle to generalise to unseen data. Many approaches have been proposed over the years to promote the generalisation ability of neural networks, collectively known as regularisation techniques. These are used as common practice under the assumption that any regularisation added to the pipeline would result in a performance improvement. In this study, we investigate whether this assumption holds in practice. First, we provide a broad review of regularisation techniques, including modern theories such as double descent. We propose a taxonomy of methods under four broad categories, namely: (1) data‑based strategies, (2) architecture strategies, (3) training strategies, and (4) loss function strategies. Notably, we highlight the contradictions and correspondences between the approaches in these broad classes. Further, we perform an empirical comparison of the various regularisation techniques on classification tasks for ten numerical and image datasets applied to the multi‑layer perceptron and convolutional neural network architectures. Results show that the efficacy of regularisation is dataset‑dependent. For example, the use of a regularisation term only improved performance on numeric datasets, whereas batch normalisation improved performance on image datasets only. Generalisation is crucial to machine learning; thus, understanding the effects of applying regularisation techniques, and considering the connections between them is essential to the appropriate use of these methods in practice.

Authors:Santanu Subhash Rathod, Pietro Liò, Xiao Zhang
Title: SplineFlow: Flow Matching for Dynamical Systems with B-Spline Interpolants
Abstract:
Flow matching is a scalable generative framework for characterizing continuous normalizing flows with wide‑range applications. However, current state‑of‑the‑art methods are not well‑suited for modeling dynamical systems, as they construct conditional paths using linear interpolants that may not capture the underlying state evolution, especially when learning higher‑order dynamics from irregular sampled observations. Constructing unified paths that satisfy multi‑marginal constraints across observations is challenging, since naïve higher‑order polynomials tend to be unstable and oscillatory. We introduce SplineFlow, a theoretically grounded flow matching algorithm that jointly models conditional paths across observations via B‑spline interpolation. Specifically, SplineFlow exploits the smoothness and stability of B‑spline bases to learn the complex underlying dynamics in a structured manner while ensuring the multi‑marginal requirements are met. Comprehensive experiments across various deterministic and stochastic dynamical systems of varying complexity, as well as on cellular trajectory inference tasks, demonstrate the strong improvement of SplineFlow over existing baselines. Our code is available at: https://github.com/santanurathod/SplineFlow.

Authors:Seyedeh Ava Razi Razavi, James Sargant, Sheridan Houghten, Renata Dividino
Title: Adaptive Edge Learning for Density-Aware Graph Generation
Abstract:
Generating realistic graph‑structured data is challenging due to discrete structures, variable sizes, and class‑specific connectivity patterns that resist conventional generative modelling. While recent graph generation methods employ generative adversarial network (GAN) frameworks to handle permutation invariance and irregular topologies, they typically rely on random edge sampling with fixed probabilities, limiting their capacity to capture complex structural dependencies between nodes. We propose a density‑aware conditional graph generation framework using Wasserstein GANs (WGAN) that replaces random sampling with a learnable distance‑based edge predictor. Our approach embeds nodes into a latent space where proximity correlates with edge likelihood, enabling the generator to learn meaningful connectivity patterns. A differentiable edge predictor determines pairwise relationships directly from node embeddings, while a density‑aware selection mechanism adaptively controls edge density to match class‑specific sparsity distributions observed in real graphs. We train the model using a WGAN with gradient penalty, employing a GCN‑based critic to ensure generated graphs exhibit realistic topology and align with target class distributions. Experiments on benchmark datasets demonstrate that our method produces graphs with superior structural coherence and class‑consistent connectivity compared to existing baselines. The learned edge predictor captures complex relational patterns beyond simple heuristics, generating graphs whose density and topology closely match real structural distributions. Our results show improved training stability and controllable synthesis, making the framework effective for realistic graph generation and data augmentation. Source code is publicly available at https://github.com/ava‑12/Density_Aware_WGAN.git.

Authors:Arvind Mahankali, Kaiyue Wen, Tengyu Ma
Title: Divide-and-Conquer CoT: RL for Reducing Latency via Parallel Reasoning
Abstract:
Long chain‑of‑thought reasoning (Long CoT) is now fundamental to state‑of‑the‑art LLMs, especially in mathematical reasoning. However, LLM generation is highly sequential, and long CoTs lead to a high latency. We propose to train Divide‑and‑Conquer CoT (DC‑CoT) to reduce the latency. With DC‑CoT, the model can act as a director that identifies distinct subtasks that can be performed in parallel in its reasoning process, and then spawns workers to execute the subtasks. Our goal is to achieve high accuracy, with a low longest path length, which is a theoretical measure of the latency needed for the response. We start with a long CoT base model (DeepScaleR‑1.5B‑Preview), and first use SFT with a small curated demonstration set to initialize its ability to spawn workers in a certain format. Because SFT degrades the accuracy significantly, we design a multi‑stage RL algorithm, with various data filtering strategies, to recover the accuracy while decreasing the longest path length. Across several benchmarks including AIME 2024 and HMMT 2025, DC‑CoT achieves similar accuracy as DeepScaleR‑1.5B‑Preview while decreasing longest path length by 35‑40%. Our code, SFT dataset and models are publicly available at https://github.com/amahankali10/DC_CoT_RL_for_Low_Latency_CoT_with_Parallel_Reasoning.

Authors:Muqing Liu, Chongjie Si, Yuheng Jia
Title: FlexLoRA: Entropy-Guided Flexible Low-Rank Adaptation
Abstract:
Large pre‑trained models achieve remarkable success across diverse domains, yet fully fine‑tuning incurs prohibitive computational and memory costs. Parameter‑efficient fine‑tuning (PEFT) has thus become a mainstream paradigm. Among them, Low‑Rank Adaptation (LoRA) introduces trainable low‑rank matrices and shows strong performance, nevertheless, its fixed‑rank design limits flexibility. Dynamic rank allocation methods mitigate this issue by pruning redundant directions; however, they often rely on heuristic, element‑level metrics that globally sort rank directions without matrix‑wise distinction, and they lack mechanisms to expand capacity in layers requiring additional adaptation. To overcome these limitations, we propose FlexLoRA, an entropy‑guided flexible low‑rank adaptation framework that (i) evaluates matrix importance via spectral energy entropy, (ii) supports rank pruning and expansion under a global budget, and (iii) employs zero‑impact initialization for newly added singular directions to ensure stability. By addressing granularity, flexibility, and stability limitations, FlexLoRA provides a more principled solution for PEFT. Extensive experiments show that FlexLoRA consistently outperforms state‑of‑the‑art baselines across benchmarks. Codes are available at https://github.com/Chongjie‑Si/Subspace‑Tuning.

Authors:Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier
Title: Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
Abstract:
We study offline reinforcement learning of style‑conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style‑Conditioned Implicit Q‑Learning (SCIQL), which leverages offline goal‑conditioned RL techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments demonstrate that SCIQL achieves superior performance on both objectives compared to prior offline methods. Code, datasets and visuals are available in: https://sciql‑iclr‑2026.github.io/.

Authors:Andrei Panferov, Erik Schultheis, Soroush Tabesh, Dan Alistarh
Title: Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation
Abstract:
The NVFP4 lower‑precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end‑to‑end fully‑quantized pre‑training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro‑scaled formats, called MS‑EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully‑NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end‑to‑end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST‑DASLab/Quartet‑II .

Authors:Dong Xu, Qihua Pan, Sisi Yuan, Jianqiang Li, Zexuan Zhu, Junkai Ji
Title: Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation
Abstract:
Molecular generative models, often employing GPT‑style language modeling on molecular string representations, have shown promising capabilities when scaled to large datasets and model sizes. However, it remains unclear and subject to debate whether these models adhere to predictable scaling laws under fixed computational budgets, which is a crucial understanding for optimally allocating resources between model size, data volume, and molecular representation. In this study, we systematically investigate the scaling behavior of molecular language models across both pretraining and downstream tasks. We train 300 models and conduct over 10,000 experiments, rigorously controlling compute budgets while independently varying model size, number of training tokens, and molecular representation. Our results demonstrate clear scaling laws in molecular models for both pretraining and downstream transfer, reveal the substantial impact of molecular representation on performance, and explain previously observed inconsistencies in scaling behavior for molecular generation. Additionally, we publicly release the largest library of molecular language models to date to facilitate future research and development. Code and models are available at https://github.com/SZU‑ADDG/MLM‑Scaling.

Authors:Ritesh Bhadana
Title: Deep Learning-Based Early-Stage IR-Drop Estimation via CNN Surrogate Modeling
Abstract:
IR‑drop is a critical power integrity challenge in modern VLSI designs that can cause timing degradation, reliability issues, and functional failures if not detected early in the design flow. Conventional IR‑drop analysis relies on physics‑based signoff tools, which provide high accuracy but incur significant computational cost and require near‑final layout information, making them unsuitable for rapid early‑stage design exploration. In this work, we propose a deep learning‑based surrogate modeling approach for early‑stage IR‑drop estimation using a CNN. The task is formulated as a dense pixel‑wise regression problem, where spatial physical layout features are mapped directly to IR‑drop heatmaps. A U‑Net‑based encoder‑decoder architecture with skip connections is employed to effectively capture both local and global spatial dependencies within the layout. The model is trained on a physics‑inspired synthetic dataset generated by us, which incorporates key physical factors including power grid structure, cell density distribution, and switching activity. Model performance is evaluated using standard regression metrics such as Mean Squared Error (MSE) and Peak Signal‑to‑Noise Ratio (PSNR). Experimental results demonstrate that the proposed approach can accurately predict IR‑drop distributions with millisecond‑level inference time, enabling fast pre‑signoff screening and iterative design optimization. The proposed framework is intended as a complementary early‑stage analysis tool, providing designers with rapid IR‑drop insight prior to expensive signoff analysis. The implementation, dataset generation scripts, and the interactive inference application are publicly available at: https://github.com/riteshbhadana/IR‑Drop‑Predictor. The live application can be accessed at: https://ir‑drop‑predictor.streamlit.app/.

Authors:Mengfan Liu, Da Zheng, Junwei Su, Chuan Wu
Title: Full-Graph vs. Mini-Batch Training: Comprehensive Analysis from a Batch Size and Fan-Out Size Perspective
Abstract:
Full‑graph and mini‑batch Graph Neural Network (GNN) training approaches have distinct system design demands, making it crucial to choose the appropriate approach to develop. A core challenge in comparing these two GNN training approaches lies in characterizing their model performance (i.e., convergence and generalization) and computational efficiency. While a batch size has been an effective lens in analyzing such behaviors in deep neural networks (DNNs), GNNs extend this lens by introducing a fan‑out size, as full‑graph training can be viewed as mini‑batch training with the largest possible batch size and fan‑out size. However, the impact of the batch and fan‑out size for GNNs remains insufficiently explored. To this end, this paper systematically compares full‑graph vs. mini‑batch training of GNNs through empirical and theoretical analyses from the view points of the batch size and fan‑out size. Our key contributions include: 1) We provide a novel generalization analysis using the Wasserstein distance to study the impact of the graph structure, especially the fan‑out size. 2) We uncover the non‑isotropic effects of the batch size and the fan‑out size in GNN convergence and generalization, providing practical guidance for tuning these hyperparameters under resource constraints. Finally, full‑graph training does not always yield better model performance or computational efficiency than well‑tuned smaller mini‑batch settings. The implementation can be found in the github link: https://github.com/LIUMENGFAN‑gif/GNN_fullgraph_minibatch_training.

Authors:Abhishek Tyagi, Yunuo Cen, Shrey Dhorajiya, Bharadwaj Veeravalli, Xuanyao Fong
Title: DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning
Abstract:
Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed‑Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset‑specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention‑Guided Runtime Tracing), a lightweight, training‑free method that performs on‑the‑fly context‑based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron‑level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA‑3.1‑8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE‑L scores with respect to static‑masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain‑specific tasks while running at less than 10MBs of memory for LLAMA‑3.1‑8B(16GBs) with 0.1% FLOPs overhead. The code is available at https://github.com/seeder‑research/DART.

Authors:En Fu, Yanyan Hu, Changhua Hu, Zengwang Jin, Kaixiang Peng
Title: PEFT-MuTS: A Multivariate Parameter-Efficient Fine-Tuning Framework for Remaining Useful Life Prediction based on Cross-domain Time Series Representation Model
Abstract:
The application of data‑driven remaining useful life (RUL) prediction has long been constrained by the availability of large amount of degradation data. Mainstream solutions such as domain adaptation and meta‑learning still rely on large amounts of historical degradation data from equipment that is identical or similar to the target, which imposes significant limitations in practical applications. This study investigates PEFT‑MuTS, a Parameter‑Efficient Fine‑Tuning framework for few‑shot RUL prediction, built on cross‑domain pre‑trained time‑series representation models. Contrary to the widely held view that knowledge transfer in RUL prediction can only occur within similar devices, we demonstrate that substantial benefits can be achieved through pre‑training process with large‑scale cross‑domain time series datasets. A independent feature tuning network and a meta‑variable‑based low rank multivariate fusion mechanism are developed to enable the pre‑trained univariate time‑series representation backbone model to fully exploit the multivariate relationships in degradation data for downstream RUL prediction task. Additionally, we introduce a zero‑initialized regressor that stabilizes the fine‑tuning process under few‑shot conditions. Experiments on aero‑engine and industrial bearing datasets demonstrate that our method can achieve effective RUL prediction even when less than 1% of samples of target equipment are used. Meanwhile, it substantially outperforms conventional supervised and few‑shot approaches while markedly reducing the data required to achieve high predictive accuracy. Our code is available at https://github.com/fuen1590/PEFT‑MuTS.

Authors:Chengyi Yang, Zhishang Xiang, Yunbo Tang, Zongpei Teng, Chengsong Huang, Fei Long, Yuhan Liu, Jinsong Su
Title: TTCS: Test-Time Curriculum Synthesis for Self-Evolving
Abstract:
Test‑Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high‑quality pseudo‑labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co‑evolving test‑time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self‑consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test‑time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general‑domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test‑time curricula for self‑evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.

Authors:Youngeun Kim
Title: MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning
Abstract:
Group‑relative policy optimization methods train language models by generating multiple rollouts per prompt and normalizing rewards with a shared mean reward baseline. In resource‑constrained settings where the rollout budget is small, accuracy often degrades. We find that noise in the shared baseline induces advantage sign flips, where some rollouts receive an incorrect advantage sign, and the update direction is reversed. To address this, we propose Median‑Centered Group Relative Policy Optimization (MC‑GRPO), a simple and effective solution for small‑rollout training. Our main idea is to replace the mean baseline with a median baseline: the median is far less sensitive to outlier rewards than the mean, mitigating the sign flips under small rollout size (G). We generate one additional rollout for median reference (G+1), and compute advantages by using the group median. With an odd‑sized group, exactly one completion is the median and receives zero advantage, we exclude this pivot rollout from backpropagation so the number of gradient‑contributing samples per prompt remains G, preserving the core update cost of standard G‑rollout training. Across various GRPO‑family methods and a wide range of models and scales, this median‑centered training consistently improves stability and final accuracy in the low‑rollout regime, reducing the gap between G=2 and G=8 to within 1%. Code is available at https://github.com/lotusroot‑kim/MC‑GRPO

Authors:Naeem Paeedeh, Mahardhika Pratama, Ary Shiddiqi, Zehong Cao, Mukesh Prasad, Wisnu Jatmiko
Title: Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model
Abstract:
Although cross‑domain few‑shot learning (CDFSL) for hyper‑spectral image (HSI) classification has attracted significant research interest, existing works often rely on an unrealistic data augmentation procedure in the form of external noise to enlarge the sample size, thus greatly simplifying the issue of data scarcity. They involve a large number of parameters for model updates, being prone to the overfitting problem. To the best of our knowledge, none has explored the strength of the foundation model, having strong generalization power to be quickly adapted to downstream tasks. This paper proposes the MIxup FOundation MOdel (MIFOMO) for CDFSL of HSI classifications. MIFOMO is built upon the concept of a remote sensing (RS) foundation model, pre‑trained across a large scale of RS problems, thus featuring generalizable features. The notion of coalescent projection (CP) is introduced to quickly adapt the foundation model to downstream tasks while freezing the backbone network. The concept of mixup domain adaptation (MDM) is proposed to address the extreme domain discrepancy problem. Last but not least, the label smoothing concept is implemented to cope with noisy pseudo‑label problems. Our rigorous experiments demonstrate the advantage of MIFOMO, where it beats prior arts with up to 14% margin. The source code of MIFOMO is open‑sourced in https://github.com/Naeem‑ Paeedeh/MIFOMO for reproducibility and convenient further study.

Authors:Aditya Sarkar, Yi Li, Jiacheng Cheng, Shlok Mishra, Nuno Vasconcelos
Title: Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
Abstract:
Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed‑set tasks, such as visual question answering with predefined options or fixed‑category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training‑free approaches of low‑complexity, applicable to any foundation model and consider methods based on external vision‑language model embeddings, like CLIP. This is denoted as Plug‑and‑Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual‑language representations, leading to high variance in image‑text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA‑PaPSP) model, which augments PaPSP with a retrieval dataset of image‑text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest‑neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA‑PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image‑text matching, and fine‑grained classification. Code is publicly available at https://github.com/kingston‑aditya/MA‑PaPSP.

Authors:Zhengyan Wan, Yidong Ouyang, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng
Title: Corrected Samplers for Discrete Flow Models
Abstract:
Discrete flow models (DFMs) have been proposed to learn the data distribution on a finite state space, offering a flexible framework as an alternative to discrete diffusion models. A line of recent work has studied samplers for discrete diffusion models, such as tau‑leaping and Euler solver. However, these samplers require a large number of iterations to control discretization error, since the transition rates are frozen in time and evaluated at the initial state within each time interval. Moreover, theoretical results for these samplers often require boundedness conditions of the transition rate or they focus on a specific type of source distributions. To address those limitations, we establish non‑asymptotic discretization error bounds for those samplers without any restriction on transition rates and source distributions, under the framework of discrete flow models. Furthermore, by analyzing a one‑step lower bound of the Euler sampler, we propose two corrected samplers: time‑corrected sampler and location‑corrected sampler, which can reduce the discretization error of tau‑leaping and Euler solver with almost no additional computational cost. We rigorously show that the location‑corrected sampler has a lower iteration complexity than existing parallel samplers. We validate the effectiveness of the proposed method by demonstrating improved generation quality and reduced inference time on both simulation and text‑to‑image generation tasks. Code can be found in https://github.com/WanZhengyan/Corrected‑Samplers‑for‑Discrete‑Flow‑Models.

Authors:Weiqi Wang, Xin Liu, Binxuan Huang, Hejie Cui, Rongzhi Zhang, Changlong Yu, Shuowei Jin, Jingfeng Yang, Qingyu Yin, Zhengyang Wang, Zheng Li, Yifan Gao, Priyanka Nigam, Bing Yin, Lihong Li, Yangqiu Song
Title: HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning
Abstract:
RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model's learning progress, so uniform sampling can't keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool‑which makes it hard to support stable on‑policy pool growth‑or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On‑Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap‑based boundary sampling, expands the pool via on‑policy augmentation with lightweight asynchronous validation, and stabilizes correlated queries through topology‑aware re‑estimation of pool statistics and controlled reinsertion. Across two training corpora, two training recipes, and seven benchmarks, HeaPA consistently improves accuracy and reaches target performance with fewer computations while keeping wall‑clock time comparable. Our analyses suggest these gains come from frontier‑focused sampling and on‑policy pool growth, with the benefits becoming larger as model scale increases. Our code is available at https://github.com/horizon‑rl/HeaPA.

Authors:Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban
Title: SCALAR: Quantifying Structural Hallucination, Consistency, and Reasoning Gaps in Materials Foundation Models
Abstract:
Large language models are increasingly applied to materials science reasoning, yet their behavior under physically structured distribution shifts remains poorly understood. We introduce SCALAR (Structural Consistency And Logic Across Regimes), a benchmark for evaluating geometric scale generalization and its connection to structural hallucination, consistency, and reasoning in materials foundation models. Given canonical crystal representations, models must reason about derived nanoparticle structures obtained through supercell expansion and geometric truncation across length scales spanning a few atoms to over 18,000 atoms, totaling \approx100,000 structures from DFT‑validated unit cells. SCALAR defines three tasks. (i) CIF to property prediction. (ii) A Chain‑of‑Thought variant with explicit physics‑grounded reasoning. (iii) Inverse retrieval identifying crystals from candidates given target properties. Outputs are evaluated via structured metrics capturing numeric error, hallucination, cross‑prompt consistency, monotonic reasoning, output validity, and retrieval regret. Experiments across diverse foundation models reveal large, model‑dependent shifts under explicit reasoning, often reducing hallucination and error, but frequently destabilizing consistency or validity. These results demonstrate that geometric scale generalization cannot be inferred from accuracy alone. Supplementary materials are available at https://github.com/KurbanIntelligenceLab/SCALAR.

Authors:Bo Yuan, Yun Zhou, Zhichao Xu, Kiran Ramnath, Aosong Feng, Balasubramaniam Srinivasan
Title: BayesFlow: A Probability Inference Framework for Meta-Agent Assisted Workflow Generation
Abstract:
Automatic workflow generation is the process of automatically synthesizing sequences of LLM calls, tool invocations, and post‑processing steps for complex end‑to‑end tasks. Most prior methods cast this task as an optimization problem with limited theoretical grounding. We propose to cast workflow generation as Bayesian inference over a posterior distribution on workflows, and introduce Bayesian Workflow Generation (BWG), a sampling framework that builds workflows step‑by‑step using parallel look‑ahead rollouts for importance weighting and a sequential in‑loop refiner for pool‑wide improvements. We prove that, without the refiner, the weighted empirical distribution converges to the target posterior. We instantiate BWG as BayesFlow, a training‑free algorithm for workflow construction. Across six benchmark datasets, BayesFlow improves accuracy by up to 9 percentage points over SOTA workflow generation baselines and by up to 65 percentage points over zero‑shot prompting, establishing BWG as a principled upgrade to search‑based workflow design. Code will be available on https://github.com/BoYuanVisionary/BayesFlow.

Authors:Manuela Chacon-Chamorro, Luis Felipe Giraldo, Nicanor Quijano
Title: Learning Reward Functions for Cooperative Resilience in Multi-Agent Systems
Abstract:
Multi‑agent systems often operate in dynamic and uncertain environments, where agents must not only pursue individual goals but also safeguard collective functionality. This challenge is especially acute in mixed‑motive multi‑agent systems. This work focuses on cooperative resilience, the ability of agents to anticipate, resist, recover, and transform in the face of disruptions, a critical yet underexplored property in Multi‑Agent Reinforcement Learning. We study how reward function design influences resilience in mixed‑motive settings and introduce a novel framework that learns reward functions from ranked trajectories, guided by a cooperative resilience metric. Agents are trained in a suite of social dilemma environments using three reward strategies: i) traditional individual reward; ii) resilience‑inferred reward; and iii) hybrid that balance both. We explore three reward parameterizations‑linear models, hand‑crafted features, and neural networks, and employ two preference‑based learning algorithms to infer rewards from behavioral rankings. Our results demonstrate that hybrid strategy significantly improve robustness under disruptions without degrading task performance and reduce catastrophic outcomes like resource overuse. These findings underscore the importance of reward design in fostering resilient cooperation, and represent a step toward developing robust multi‑agent systems capable of sustaining cooperation in uncertain environments.

Authors:Shirin Reyhanian, Laurenz Wiskott
Title: Is Hierarchical Quantization Essential for Optimal Reconstruction?
Abstract:
Vector‑quantized variational autoencoders (VQ‑VAEs) are central to models that rely on high reconstruction fidelity, from neural compression to generative pipelines. Hierarchical extensions, such as VQ‑VAE2, are often credited with superior reconstruction performance because they split global and local features across multiple levels. However, since higher levels derive all their information from lower levels, they should not carry additional reconstructive content beyond what the lower‑level already encodes. Combined with recent advances in training objectives and quantization mechanisms, this leads us to ask whether a single‑level VQ‑VAE, with matched representational budget and no codebook collapse, can equal the reconstruction fidelity of its hierarchical counterpart. Although the multi‑scale structure of hierarchical models may improve perceptual quality in downstream tasks, the effect of hierarchy on reconstruction accuracy, isolated from codebook utilization and overall representational capacity, remains empirically underexamined. We revisit this question by comparing a two‑level VQ‑VAE and a capacity‑matched single‑level model on high‑resolution ImageNet images. Consistent with prior observations, we confirm that inadequate codebook utilization limits single‑level VQ‑VAEs and that overly high‑dimensional embeddings destabilize quantization and increase codebook collapse. We show that lightweight interventions such as initialization from data, periodic reset of inactive codebook vectors, and systematic tuning of codebook hyperparameters significantly reduce collapse. Our results demonstrate that when representational budgets are matched, and codebook collapse is mitigated, single‑level VQ‑VAEs can match the reconstruction fidelity of hierarchical variants, challenging the assumption that hierarchical quantization is inherently superior for high‑quality reconstructions.

Authors:Ido Aharon, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus
Title: Tacit Coordination of Large Language Models
Abstract:
In tacit coordination games with multiple outcomes, purely rational solution concepts, such as Nash equilibria, provide no guidance for which equilibrium to choose. Shelling's theory explains how, in these settings, humans coordinate by relying on focal points: solutions or outcomes that naturally arise because they stand out in some way as salient or prominent to all players. This work studies Large Language Models (LLMs) as players in tacit coordination games, and addresses how, when, and why focal points emerge. We compare and quantify the coordination capabilities of LLMs in cooperative and competitive games for which human experiments are available. We also introduce several learning‑free strategies to improve the coordination of LLMs, with themselves and with humans. On a selection of heterogeneous open‑source models, including Llama, Qwen, and GPT‑oss, we discover that LLMs have a remarkable capability to coordinate and often outperform humans, yet fail on common‑sense coordination that involves numbers or nuanced cultural archetypes. This paper constitutes the first large‑scale assessment of LLMs' tacit coordination within the theoretical and psychological framework of focal points.

Authors:Daniel Stein, Shaoyi Huang, Rolf Drechsler, Bing Li, Grace Li Zhang
Title: Late Breaking Results: Conversion of Neural Networks into Logic Flows for Edge Computing
Abstract:
Neural networks have been successfully applied in various resource‑constrained edge devices, where usually central processing units (CPUs) instead of graphics processing units exist due to limited power availability. State‑of‑the‑art research still focuses on efficiently executing enormous numbers of multiply‑accumulate (MAC) operations. However, CPUs themselves are not good at executing such mathematical operations on a large scale, since they are more suited to execute control flow logic, i.e., computer algorithms. To enhance the computation efficiency of neural networks on CPUs, in this paper, we propose to convert them into logic flows for execution. Specifically, neural networks are first converted into equivalent decision trees, from which decision paths with constant leaves are then selected and compressed into logic flows. Such logic flows consist of if and else structures and a reduced number of MAC operations. Experimental results demonstrate that the latency can be reduced by up to 14.9 % on a simulated RISC‑V CPU without any accuracy degradation. The code is open source at https://github.com/TUDa‑HWAI/NN2Logic

Authors:Zhengyan Huan, Camila Pazos, Martin Klassen, Vincent Croft, Pierre-Hugues Beauchemin, Shuchin Aeron
Title: The Ensemble Inverse Problem: Applications and Methods
Abstract:
We introduce a new multivariate statistical problem that we refer to as the Ensemble Inverse Problem (EIP). The aim of EIP is to invert for an ensemble that is distributed according to the pushforward of a prior under a forward process. In high energy physics (HEP), this is related to a widely known problem called unfolding, which aims to reconstruct the true physics distribution of quantities, such as momentum and angle, from measurements that are distorted by detector effects. In recent applications, the EIP also arises in full waveform inversion (FWI) and inverse imaging with unknown priors. We propose non‑iterative inference‑time methods that construct posterior samplers based on a new class of conditional generative models, which we call ensemble inverse generative models. For the posterior modeling, these models additionally use the ensemble information contained in the observation set on top of single measurements. Unlike existing methods, our proposed methods avoid explicit and iterative use of the forward model at inference time via training across several sets of truth‑observation pairs that are consistent with the same forward model, but originate from a wide range of priors. We demonstrate that this training procedure implicitly encodes the likelihood model. The use of ensemble information helps posterior inference and enables generalization to unseen priors. We benchmark the proposed method on several synthetic and real datasets in inverse imaging, HEP, and FWI. The codes are available at https://github.com/ZhengyanHuan/The‑Ensemble‑Inverse‑Problem‑‑Applications‑and‑Methods.

Authors:Meng Cao, Jiexi Liu, Songcan Chen
Title: Negatives-Dominant Contrastive Learning for Generalization in Imbalanced Domains
Abstract:
Imbalanced Domain Generalization (IDG) focuses on mitigating both domain and label shifts, both of which fundamentally shape the model's decision boundaries, particularly under heterogeneous long‑tailed distributions across domains. Despite its practical significance, it remains underexplored, primarily due to the technical complexity of handling their entanglement and the paucity of theoretical foundations. In this paper, we begin by theoretically establishing the generalization bound for IDG, highlighting the role of posterior discrepancy and decision margin. This bound motivates us to focus on directly steering decision boundaries, marking a clear departure from existing methods. Subsequently, we technically propose a novel Negative‑Dominant Contrastive Learning (NDCL) for IDG to enhance discriminability while enforce posterior consistency across domains. Specifically, inter‑class decision‑boundary separation is enhanced by placing greater emphasis on negatives as the primary signal in our contrastive learning, naturally amplifying gradient signals for minority classes to avoid the decision boundary being biased toward majority classes. Meanwhile, intra‑class compactness is encouraged through a re‑weighted cross‑entropy strategy, and posterior consistency across domains is enforced through a prediction‑central alignment strategy. Finally, rigorous yet challenging experiments on benchmarks validate the effectiveness of our NDCL. The code is available at https://github.com/Alrash/NDCL.

Authors:Jian Gao, Yiwei Zou, Abhishek Pradhan, Wenhao Huang, Yumin Su, Kaiyuan Yang, Xuan Zhang
Title: PowerGenie: Analytically-Guided Evolutionary Discovery of Superior Reconfigurable Power Converters
Abstract:
Discovering superior circuit topologies requires navigating an exponentially large design space‑a challenge traditionally reserved for human experts. Existing AI methods either select from predefined templates or generate novel topologies at a limited scale without rigorous verification, leaving large‑scale performance‑driven discovery underexplored. We present PowerGenie, a framework for automated discovery of higher‑performance reconfigurable power converters at scale. PowerGenie introduces: (1) an automated analytical framework that determines converter functionality and theoretical performance limits without component sizing or SPICE simulation, and (2) an evolutionary finetuning method that co‑evolves a generative model with its training distribution through fitness selection and uniqueness verification. Unlike existing methods that suffer from mode collapse and overfitting, our approach achieves higher syntax validity, function validity, novelty rate, and figure‑of‑merit (FoM). PowerGenie discovers a novel 8‑mode reconfigurable converter with 23% higher FoM than the best training topology. SPICE simulations confirm average absolute efficiency gains of 10% across 8 modes and up to 17% at a single mode. Code is available at https://github.com/xz‑group/PowerGenie.

Authors:Qianwei Yang, Dong Xu, Zhangfan Yang, Sisi Yuan, Zexuan Zhu, Jianqiang Li, Junkai Ji
Title: From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation
Abstract:
Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT‑based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large‑scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph‑structured nature of molecules when formulated as next‑token prediction problems, and they typically lack explicit mechanisms for target‑aware generation. Here, we propose SoftMol, a unified framework that co‑designs molecular representation, model architecture, and search strategy for target‑aware molecular generation. SoftMol introduces soft fragments, a rule‑free block representation of SMILES that enables diffusion‑native modeling, and develops SoftBD, the first block‑diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug‑likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC‑Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target‑aware manner. Experimental results show that, compared with current state‑of‑the‑art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2‑3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu‑aicourse/softmol

Authors:Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo
Title: DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
Abstract:
Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention‑3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non‑deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient‑reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High‑Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q‑Tile Iteration, a reversed query‑block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28× compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open‑sourced at https://github.com/SJTU‑Liquid/deterministic‑FA3.

Authors:Alexandre Myara, Nicolas Bourriez, Thomas Boyer, Thomas Lemercier, Ihab Bendidi, Auguste Genovesio
Title: XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision
Abstract:
Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textscXFactors, a weakly‑supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace \mathcalS and factor‑specific subspaces \mathcalT_1,\ldots,\mathcalT_K and a residual subspace \mathcalS. Each target factor is encoded in its assigned \mathcalT_i through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both \mathcalS and the aggregated factor subspaces, organizing the geometry without additional supervision for non‑targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textscXFactors achieves state‑of‑the‑art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real‑world dataset CelebA. Our code is available at \hrefhttps://github.com/ICML26‑anon/XFactorsgithub.com/ICML26‑anon/XFactors.

Authors:Qisong Xiao, Xinhai Chen, Qinglin Wang, Xiaowei Guo, Binglin Wang, Weifeng Chen, Zhichao Wang, Yunfei Liu, Rui Xia, Hang Zou, Gencheng Liu, Shuai Li, Jie Liu
Title: LLM4Fluid: Large Language Models as Generalizable Neural Solvers for Fluid Dynamics
Abstract:
Deep learning has emerged as a promising paradigm for spatio‑temporal modeling of fluid dynamics. However, existing approaches often suffer from limited generalization to unseen flow conditions and typically require retraining when applied to new scenarios. In this paper, we present LLM4Fluid, a spatio‑temporal prediction framework that leverages Large Language Models (LLMs) as generalizable neural solvers for fluid dynamics. The framework first compresses high‑dimensional flow fields into a compact latent space via reduced‑order modeling enhanced with a physics‑informed disentanglement mechanism, effectively mitigating spatial feature entanglement while preserving essential flow structures. A pretrained LLM then serves as a temporal processor, autoregressively predicting the dynamics of physical sequences with time series prompts. To bridge the modality gap between prompts and physical sequences, which can otherwise degrade prediction accuracy, we propose a dedicated modality alignment strategy that resolves representational mismatch and stabilizes long‑term prediction. Extensive experiments across diverse flow scenarios demonstrate that LLM4Fluid functions as a robust and generalizable neural solver without retraining, achieving state‑of‑the‑art accuracy while exhibiting powerful zero‑shot and in‑context learning capabilities. Code and datasets are publicly available at https://github.com/qisongxiao/LLM4Fluid.

Authors:Tianqi Zhao, Guanyang Wang, Yan Shuo Tan, Qiong Zhang
Title: TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering
Abstract:
Clustering tabular data is a fundamental yet challenging problem due to heterogeneous feature types, diverse data‑generating mechanisms, and the absence of transferable inductive biases across datasets. Prior‑fitted networks (PFNs) have recently demonstrated strong generalization in supervised tabular learning by amortizing Bayesian inference under a broad synthetic prior. Extending this paradigm to clustering is nontrivial: clustering is unsupervised, admits a combinatorial and permutation‑invariant output space, and requires inferring the number of clusters. We introduce TabClustPFN, a prior‑fitted network for tabular data clustering that performs amortized Bayesian inference over both cluster assignments and cluster cardinality. Pretrained on synthetic datasets drawn from a flexible clustering prior, TabClustPFN clusters unseen datasets in a single forward pass, without dataset‑specific retraining or hyperparameter tuning. The model naturally handles heterogeneous numerical and categorical features and adapts to a wide range of clustering structures. Experiments on synthetic data and curated real‑world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out‑of‑the‑box exploratory settings. Code is available at https://github.com/Tianqi‑Zhao/TabClustPFN.

Authors:Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Danilo Mandic
Title: KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices
Abstract:
The success of Hyper‑Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold‑Constrained Hyper‑Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn‑Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive \mathcalO(n^3C) parameter complexity with n as the width of the residual stream and C as the feature dimension. The recently proposed mHC‑lite reparametrizes the residual matrix via the Birkhoff‑von‑Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, \mathcalO \left( nC \cdot n! \right). To address both challenges, we propose KromHC, which uses the \underlineKronecker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underlinemHC. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to \mathcalO(n^2C). Comprehensive experiments demonstrate that KromHC matches or even outperforms state‑of‑the‑art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttthttps://github.com/wz1119/KromHC.

Authors:Lige Zhang, Ali Maatouk, Jialin Chen, Leandros Tassiulas, Rex Ying
Title: Multi-Modal Time Series Prediction via Mixture of Modulated Experts
Abstract:
Real‑world time series exhibit complex and evolving dynamics, making accurate forecasting extremely challenging. Recent multi‑modal forecasting methods leverage textual information such as news reports to improve prediction, but most rely on token‑level fusion that mixes temporal patches with language tokens in a shared embedding space. However, such fusion can be ill‑suited when high‑quality time‑text pairs are scarce and when time series exhibit substantial variation in scale and characteristics, thus complicating cross‑modal alignment. In parallel, Mixture‑of‑Experts (MoE) architectures have proven effective for both time series modeling and multi‑modal learning, yet many existing MoE‑based modality integration methods still depend on token‑level fusion. To address this, we propose Expert Modulation, a new paradigm for multi‑modal time series prediction that conditions both routing and expert computation on textual signals, enabling direct and efficient cross‑modal control over expert behavior. Through comprehensive theoretical analysis and experiments, our proposed method demonstrates substantial improvements in multi‑modal time series prediction. The current code is available at https://github.com/BruceZhangReve/MoME

Authors:Robert van der Klis, Ricardo Chávez Torres, Max van Spengler, Yuhui Ding, Thomas Hofmann, Pascal Mettes
Title: Fast and Geometrically Grounded Lorentz Neural Networks
Abstract:
Hyperbolic space is quickly gaining traction as a promising geometry for hierarchical and robust representation learning. A core open challenge is the development of a mathematical formulation of hyperbolic neural networks that is both efficient and captures the key properties of hyperbolic space. The Lorentz model of hyperbolic space has been shown to enable both fast forward and backward propagation. However, we prove that, with the current formulation of Lorentz linear layers, the hyperbolic norms of the outputs scale logarithmically with the number of gradient descent steps, nullifying the key advantage of hyperbolic geometry. We propose a new Lorentz linear layer grounded in the well‑known ``distance‑to‑hyperplane" formulation. We prove that our formulation results in the usual linear scaling of output hyperbolic norms with respect to the number of gradient descent steps. Our new formulation, together with further algorithmic efficiencies through Lorentzian activation functions and a new caching strategy results in neural networks fully abiding by hyperbolic geometry while simultaneously bridging the computation gap to Euclidean neural networks. Code available at: https://github.com/robertdvdk/hyperbolic‑fully‑connected.

Authors:Minjae Cho, Huy Trong Tran
Title: Intrinsic Reward Policy Optimization for Sparse-Reward Environments
Abstract:
Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub‑optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm ‑‑ intrinsic reward policy optimization (IRPO) ‑‑ achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse‑reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at https://github.com/Mgineer117/IRPO.

Authors:Aoyu Pang, Maonan Wang, Zifan Sha, Wenwei Yue, Changle Li, Chung Shue Chen, Man-On Pun
Title: Heterogeneous Vertiport Selection Optimization for On-Demand Air Taxi Services: A Deep Reinforcement Learning Approach
Abstract:
Urban Air Mobility (UAM) has emerged as a transformative solution to alleviate urban congestion by utilizing low‑altitude airspace, thereby reducing pressure on ground transportation networks. To enable truly efficient and seamless door‑to‑door travel experiences, UAM requires close integration with existing ground transportation infrastructure. However, current research on optimal integrated routing strategies for passengers in air‑ground mobility systems remains limited, with a lack of systematic exploration.To address this gap, we first propose a unified optimization model that integrates strategy selection for both air and ground transportation. This model captures the dynamic characteristics of multimodal transport networks and incorporates real‑time traffic conditions alongside passenger decision‑making behavior. Building on this model, we propose a Unified Air‑Ground Mobility Coordination (UAGMC) framework, which leverages deep reinforcement learning (RL) and Vehicle‑to‑Everything (V2X) communication to optimize vertiport selection and dynamically plan air taxi routes. Experimental results demonstrate that UAGMC achieves a 34% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency and providing novel insights into the integration and optimization of multimodal transportation systems. This work lays a solid foundation for advancing intelligent urban mobility solutions through the coordination of air and ground transportation modes. The related code can be found at https://github.com/Traffic‑Alpha/UAGMC.

Authors:Seonghyeon Go, Yumin Kim
Title: Music Plagiarism Detection: Problem Formulation and a Segment-based Solution
Abstract:
Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real‑world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at https://github.com/Mippia/ICASSP2026‑MPD.

Authors:Minjae Kwon, Josephine Lamp, Lu Feng
Title: Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed
Abstract:
Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training‑time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety‑critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test‑time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time‑in‑Range gains of 13‑‑14% for strong baselines such as PPO‑Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety‑critical control domains. Code is available at https://github.com/safe‑autonomy‑lab/GlucoSim and https://github.com/safe‑autonomy‑lab/GlucoAlg.

Authors:Marvin Sextro, Weronika Kłos, Gabriel Dernbach
Title: MapPFN: Learning Causal Perturbation Maps in Context
Abstract:
Planning effective interventions in biological systems requires treatment‑effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single‑cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta‑learn a perturbation effect estimator, we present MapPFN, a prior‑data fitted network (PFN) pretrained on synthetic data generated from a prior over causal perturbations. Given a set of experiments, MapPFN uses in‑context learning to predict post‑perturbation distributions, without gradient‑based optimization. Despite being pretrained on in silico gene knockouts alone, MapPFN identifies differentially expressed genes, matching the performance of models trained on real single‑cell data. Our code and data are available at https://github.com/marvinsxtr/MapPFN.

Authors:Zongheng Guo, Tao Chen, Yang Jiao, Yi Pan, Xiao Hu, Manuela Ferrario
Title: SIGMA-PPG: Statistical-prior Informed Generative Masking Architecture for PPG Foundation Model
Abstract:
Current foundation model for photoplethysmography (PPG) signals is challenged by the intrinsic redundancy and noise of the signal. Standard masked modeling often yields trivial solutions while contrastive methods lack morphological precision. To address these limitations, we propose a Statistical‑prior Informed Generative Masking Architecture (SIGMA‑PPG), a generative foundation model featuring a Prior‑Guided Adversarial Masking mechanism, where a reinforcement learning‑driven teacher leverages statistical priors to create challenging learning paths that prevent overfitting to noise. We also incorporate a semantic consistency constraint via vector quantization to ensure that physiologically identical waveforms (even those altered by recording artifacts or minor perturbations) map to shared indices. This enhances codebook semantic density and eliminates redundant feature structures. Pre‑trained on over 120,000 hours of data, SIGMA‑PPG achieves superior average performance compared to five state‑of‑the‑art baselines across 12 diverse downstream tasks. The code is available at https://github.com/ZonghengGuo/SigmaPPG.

Authors:Hajung Kim, Eunha Lee, Sohyun Chung, Jueon Park, Seungheun Baek, Jaewoo Kang
Title: ATTNSOM: Learning Cross-Isoform Attention for Cytochrome P450 Site-of-Metabolism
Abstract:
Identifying metabolic sites where cytochrome P450 enzymes metabolize small‑molecule drugs is essential for drug discovery. Although existing computational approaches have been proposed for site‑of‑metabolism prediction, they typically ignore cytochrome P450 isoform identity or model isoforms independently, thereby failing to fully capture inherent cross‑isoform metabolic patterns. In addition, prior evaluations often rely on top‑k metrics, where false positive atoms may be included among the top predictions, underscoring the need for complementary metrics that more directly assess binary atom‑level discrimination under severe class imbalance. We propose ATTNSOM, an atom‑level site‑of‑metabolism prediction framework that integrates intrinsic molecular reactivity with cross‑isoform relationships. The model combines a shared graph encoder, molecule‑conditioned atom representations, and a cross‑attention mechanism to capture correlated metabolic patterns across cytochrome P450 isoforms. The model is evaluated on two benchmark datasets annotated with site‑of‑metabolism labels at atom resolution. Across these benchmarks, the model achieves consistently strong top‑k performance across multiple cytochrome P450 isoforms. Relative to ablated variants, the model yields higher Matthews correlation coefficient, indicating improved discrimination of true metabolic sites. These results support the importance of explicitly modeling cross‑isoform relationships for site‑of‑metabolism prediction. The code and datasets are available at https://github.com/dmis‑lab/ATTNSOM.

Authors:Bharath Krishnamurthy, Ajita Rattani
Title: VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings
Abstract:
Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non‑scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound‑morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero‑shot framework that produces high‑fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine‑grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state‑of‑the‑art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: https://vcbsl.github.io/VoxMorph/

Authors:Matteo Gianferrari, Omayma Moussadek, Riccardo Salami, Cosimo Fiorini, Lorenzo Tartarini, Daniela Gandolfi, Simone Calderara
Title: STAER: Temporal Aligned Rehearsal for Continual Spiking Neural Network
Abstract:
Spiking Neural Networks (SNNs) are inherently suited for continuous learning due to their event‑driven temporal dynamics; however, their application to Class‑Incremental Learning (CIL) has been hindered by catastrophic forgetting and the temporal misalignment of spike patterns. In this work, we introduce Spiking Temporal Alignment with Experience Replay (STAER), a novel framework that explicitly preserves temporal structure to bridge the performance gap between SNNs and ANNs. Our approach integrates a differentiable Soft‑DTW alignment loss to maintain spike timing fidelity and employs a temporal expansion and contraction mechanism on output logits to enforce robust representation learning. Implemented on a deep ResNet19 spiking backbone, STAER achieves state‑of‑the‑art performance on Sequential‑MNIST and Sequential‑CIFAR10. Empirical results demonstrate that our method matches or outperforms strong ANN baselines (ER, DER++) while preserving biologically plausible dynamics. Ablation studies further confirm that explicit temporal alignment is critical for representational stability, positioning STAER as a scalable solution for spike‑native lifelong learning. Code is available at https://github.com/matteogianferrari/staer.

Authors:Hao Sun, Da-Wei Zhou
Title: C3Box: A CLIP-based Class-Incremental Learning Toolbox
Abstract:
Traditional machine learning systems are typically designed for static data distributions, which suffer from catastrophic forgetting when learning from evolving data streams. Class‑Incremental Learning (CIL) addresses this challenge by enabling learning systems to continuously learn new classes while preserving prior knowledge. With the rise of pre‑trained models (PTMs) such as CLIP, leveraging their strong generalization and semantic alignment capabilities has become a promising direction in CIL. However, existing CLIP‑based CIL methods are often scattered across disparate codebases, rely on inconsistent configurations, hindering fair comparisons, reproducibility, and practical adoption. Therefore, we propose C3Box (CLIP‑based Class‑inCremental learning toolBOX), a modular and comprehensive Python toolbox. C3Box integrates representative traditional CIL methods, ViT‑based CIL methods, and state‑of‑the‑art CLIP‑based CIL methods into a unified CLIP‑based framework. By inheriting the streamlined design of PyCIL, C3Box provides a JSON‑based configuration and standardized execution pipeline. This design enables reproducible experimentation with low engineering overhead and makes C3Box a reliable benchmark platform for continual learning research. Designed to be user‑friendly, C3Box relies only on widely used open‑source libraries and supports major operating systems. The code is available at https://github.com/LAMDA‑CL/C3Box.

Authors:Weixin Chen, Li Chen, Yuhan Zhao
Title: Post-Training Fairness Control: A Single-Train Framework for Dynamic Fairness in Recommendation
Abstract:
Despite growing efforts to mitigate unfairness in recommender systems, existing fairness‑aware methods typically fix the fairness requirement at training time and provide limited post‑training flexibility. However, in real‑world scenarios, diverse stakeholders may demand differing fairness requirements over time, so retraining for different fairness requirements becomes prohibitive. To address this limitation, we propose Cofair, a single‑train framework that enables post‑training fairness control in recommendation. Specifically, Cofair introduces a shared representation layer with fairness‑conditioned adapter modules to produce user embeddings specialized for varied fairness levels, along with a user‑level regularization term that guarantees user‑wise monotonic fairness improvements across these levels. We theoretically establish that the adversarial objective of Cofair upper bounds demographic parity and the regularization term enforces progressive fairness at user level. Comprehensive experiments on multiple datasets and backbone models demonstrate that our framework provides dynamic fairness at different levels, delivering comparable or better fairness‑accuracy curves than state‑of‑the‑art baselines, without the need to retrain for each new fairness requirement. Our code is publicly available at https://github.com/weixinchen98/Cofair.

Authors:Shahd Seddik, Fahd Seddik, Iman Saberi, Fatemeh Fard, Minh Hieu Huynh, Patanamon Thongtanunam
Title: Context-Augmented Code Generation Using Programming Knowledge Graphs
Abstract:
Large Language Models (LLMs) excel at code generation but struggle with complex problems. Retrieval‑Augmented Generation (RAG) mitigates this issue by integrating external knowledge, yet retrieval models often miss relevant context, and generation models hallucinate with irrelevant data. We propose Programming Knowledge Graph (PKG) for semantic representation and fine‑grained retrieval of code and text. Our approach enhances retrieval precision through tree pruning and mitigates hallucinations via a re‑ranking mechanism that integrates non‑RAG solutions. Structuring external data into finer‑grained nodes improves retrieval granularity. Evaluations on HumanEval and MBPP show up to 20% pass@1 accuracy gains and a 34% improvement over baselines on MBPP. Our findings demonstrate that our proposed PKG approach along with re‑ranker effectively address complex problems while maintaining minimal negative impact on solutions that are already correct without RAG. The replication package is published at https://github.com/iamshahd/ProgrammingKnowledgeGraph

Authors:Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang
Title: HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
Abstract:
As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low‑bit quantization. However, most quantization‑aware training (QAT) methods apply hard rounding and the straight‑through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian‑guided differentiable QAT framework for extremely low‑bit LLMs, which replaces the rigid step function with a temperature‑controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor‑wise Hessian trace metric as a lightweight curvature signal to drive fine‑grained temperature annealing, enabling sensitivity‑aware discretization across the model. Evaluations on Llama‑3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero‑shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian‑guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58‑bit LLMs. The code is available at https://github.com/hestia2026/Hestia.

Authors:Ariel Maymon, Yanir Buznah, Uri Shaham
Title: Unsupervised Ensemble Learning Through Deep Energy-based Models
Abstract:
Unsupervised ensemble learning emerged to address the challenge of combining multiple learners' predictions without access to ground truth labels or additional data. This paradigm is crucial in scenarios where evaluating individual classifier performance or understanding their strengths is challenging due to limited information. We propose a novel deep energy‑based method for constructing an accurate meta‑learner using only the predictions of individual learners, potentially capable of capturing complex dependence structures between them. Our approach requires no labeled data, learner features, or problem‑specific information, and has theoretical guarantees for when learners are conditionally independent. We demonstrate superior performance across diverse ensemble scenarios, including challenging mixture of experts settings. Our experiments span standard ensemble datasets and curated datasets designed to test how the model fuses expertise from multiple sources. These results highlight the potential of unsupervised ensemble learning to harness collective intelligence, especially in data‑scarce or privacy‑sensitive environments.

Authors:Zhiyu Chen, Minhao Liu, Yanru Zhang
Title: TimeCatcher: A Variational Framework for Volatility-Aware Forecasting of Non-Stationary Time Series
Abstract:
Recent lightweight MLP‑based models have achieved strong performance in time series forecasting by capturing stable trends and seasonal patterns. However, their effectiveness hinges on an implicit assumption of local stationarity assumption, making them prone to errors in long‑term forecasting of highly non‑stationary series, especially when abrupt fluctuations occur, a common challenge in domains like web traffic monitoring. To overcome this limitation, we propose TimeCatcher, a novel Volatility‑Aware Variational Forecasting framework. TimeCatcher extends linear architectures with a variational encoder to capture latent dynamic patterns hidden in historical data and a volatility‑aware enhancement mechanism to detect and amplify significant local variations. Experiments on nine real‑world datasets from traffic, financial, energy, and weather domains show that TimeCatcher consistently outperforms state‑of‑the‑art baselines, with particularly large improvements in long‑term forecasting scenarios characterized by high volatility and sudden fluctuations. Our code is available at https://github.com/ColaPrinceCHEN/TimeCatcher.

Authors:Mariia Drozdova
Title: Can Continuous-Time Diffusion Models Generate and Solve Globally Constrained Discrete Problems? A Study on Sudoku
Abstract:
Can standard continuous‑time generative models represent distributions whose support is an extremely sparse, globally constrained discrete set? We study this question using completed Sudoku grids as a controlled testbed, treating them as a subset of a continuous relaxation space. We train flow‑matching and score‑based models along a Gaussian probability path and compare deterministic (ODE) sampling, stochastic (SDE) sampling, and DDPM‑style discretizations derived from the same continuous‑time training. Unconditionally, stochastic sampling substantially outperforms deterministic flows; score‑based samplers are the most reliable among continuous‑time methods, and DDPM‑style ancestral sampling achieves the highest validity overall. We further show that the same models can be repurposed for guided generation: by repeatedly sampling completions under clamped clues and stopping when constraints are satisfied, the model acts as a probabilistic Sudoku solver. Although far less sample‑efficient than classical solvers and discrete‑geometry‑aware diffusion methods, these experiments demonstrate that classic diffusion/flow formulations can assign non‑zero probability mass to globally constrained combinatorial structures and can be used for constraint satisfaction via stochastic search.

Authors:Minjae Lee, Wonjun Kang, Byeongkeun Ahn, Christian Classen, Kevin Galim, Seunghyuk Oh, Minghao Yan, Hyung Il Koo, Kangwook Lee
Title: TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
Abstract:
Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision‑Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario‑specific performance fluctuations. Motivated by these findings, we propose Test‑time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training‑free and keeping ensembling costs negligible through parameter sharing. With its plug‑and‑play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom‑trained models are available at https://github.com/furiosa‑ai/TABED.

Authors:Murad Farzulla
Title: Do Whitepaper Claims Predict Market Behavior? Evidence from Cryptocurrency Factor Analysis
Abstract:
Cryptocurrency projects articulate value propositions through whitepapers, making claims about functionality and technical capabilities. This study investigates whether these narratives align with observed market behavior. We construct a pipeline combining zero‑shot NLP classification (BART‑MNLI) with CP tensor decomposition to compare three spaces: (1) a claims matrix from 24 whitepapers across 10 semantic categories, (2) market statistics for 49 assets over two years of hourly data, and (3) latent factors from tensor decomposition (rank 2, 92.45% variance explained). Using Procrustes rotation and Tucker's congruence coefficient, we test alignment across 23 common entities. Results show weak alignment: claims‑statistics (phi=0.341, p=0.332), claims‑factors (phi=0.077, p=0.747), and statistics‑factors (phi=0.197, p<0.001). The statistics‑factors significance validates our methodology, confirming the pipeline detects relationships when present. Inter‑model validation with DeBERTa‑v3 yields 32% exact agreement but 67% top‑3 agreement. Cross‑sectional analysis reveals heterogeneous contributions: NEAR, MKR, ATOM show positive alignment while ENS, UNI, Bitcoin diverge most. Excluding Bitcoin confirms results are not driven by market dominance. We interpret findings as weak alignment between whitepaper narratives and market factor structure. Limited power (n=23) precludes distinguishing weak from no alignment, but strong alignment (phi>=0.70) can be confidently rejected. Implications for narrative economics and investment analysis are discussed.

Authors:Brian Y. Tsui, Alan Y. Fang, Tiffany J. Hwu
Title: Demonstration-Free Robotic Control via LLM Agents
Abstract:
Robotic manipulation has increasingly adopted vision‑language‑action (VLA) models, which achieve strong performance but typically require task‑specific demonstrations and fine‑tuning, and often generalize poorly under domain shift. We investigate whether general‑purpose large language model (LLM) agent frameworks, originally developed for software engineering, can serve as an alternative control paradigm for embodied manipulation. We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification. Using the same iterative reasoning that enables software agents to debug code, FAEA enables embodied agents to reason through manipulation strategies. We evaluate an unmodified frontier agent, Claude Agent SDK, across the LIBERO, ManiSkill3, and MetaWorld benchmarks. With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively. This level of task success approaches that of VLA models trained with less than 100 demonstrations per task, without requiring demonstrations or fine‑tuning. With one round of human feedback as an optional optimization, performance increases to 88.2% on LIBERO. This demonstration‑free capability has immediate practical value: FAEA can autonomously explore novel scenarios in simulation and generate successful trajectories for training data augmentation in embodied learning. Our results indicate that general‑purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task‑level planning. This opens a path for robotics systems to leverage actively maintained agent infrastructure and benefit directly from ongoing advances in frontier models. Code is available at https://github.com/robiemusketeer/faea‑sim

Authors:Fengrui Zuo, Zhiwei Ke, Yiming Liu, Wenqi Lou, Chao Wang, Xvehai Zhou
Title: Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching
Abstract:
Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full‑sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block‑wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token‑level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix‑localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage‑wise temporal stability, enabling reuse of intermediate representations except for a brief post‑decode transient. Motivated by these observations, we propose \placeholder\footnoteThe source code is available at https://github.com/vhicrgit/Window‑Diffusion., a window‑based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) active tokens that are computed online, (ii) buffer tokens whose KV states are cached and periodically refreshed, and (iii) far‑field tokens that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far‑field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to 99× inference speedup while largely preserving generation performance.

Authors:Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan
Title: Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning
Abstract:
KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV‑derived representations are shown to be sufficient for two key applications: (i) Chain‑of‑Embedding, where they achieve competitive or superior performance on Llama‑3.1‑8B‑Instruct and Qwen2‑7B‑Instruct; and (ii) Fast/Slow Thinking Switching, where they enable adaptive reasoning on Qwen3‑8B and DeepSeek‑R1‑Distil‑Qwen‑14B, reducing token generation by up to 5.7× with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV‑Embedding.

Authors:Xinyu Li, Sishuo Chen, Guipeng Xv, Li Zhang, Mingxuan Luo, Zhangming Chan, Xiang-Rong Sheng, Han Zhu, Jian Xu, Chen Lin
Title: Delayed Feedback Modeling for Post-Click Gross Merchandise Volume Prediction: Benchmark, Insights and Approaches
Abstract:
The prediction objectives of online advertisement ranking models are evolving from probabilistic metrics like conversion rate (CVR) to numerical business metrics like post‑click gross merchandise volume (GMV). Unlike the well‑studied delayed feedback problem in CVR prediction, delayed feedback modeling for GMV prediction remains unexplored and poses greater challenges, as GMV is a continuous target, and a single click can lead to multiple purchases that cumulatively form the label. To bridge the research gap, we establish TRACE, a GMV prediction benchmark containing complete transaction sequences rising from each user click, which supports delayed feedback modeling in an online streaming manner. Our analysis and exploratory experiments on TRACE reveal two key insights: (1) the rapid evolution of the GMV label distribution necessitates modeling delayed feedback under online streaming training; (2) the label distribution of repurchase samples substantially differs from that of single‑purchase samples, highlighting the need for separate modeling. Motivated by these findings, we propose RepurchasE‑Aware Dual‑branch prEdictoR (READER), a novel GMV modeling paradigm that selectively activates expert parameters according to repurchase predictions produced by a router. Moreover, READER dynamically calibrates the regression target to mitigate under‑estimation caused by incomplete labels. Experimental results show that READER yields superior performance on TRACE over baselines, achieving a 2.19% improvement in terms of accuracy. We believe that our study will open up a new avenue for studying online delayed feedback modeling for GMV prediction, and our TRACE benchmark with the gathered insights will facilitate future research and application in this promising direction. Our code and dataset are available at https://github.com/alimama‑tech/OnlineGMV .

Authors:Jie Tang, Chuanlong Xie, Xianli Zeng, Lixing Zhu
Title: Empirical Likelihood-Based Fairness Auditing: Distribution-Free Certification and Flagging
Abstract:
Machine learning models in high‑stakes applications, such as recidivism prediction and automated personnel selection, often exhibit systematic performance disparities across sensitive subpopulations, raising critical concerns regarding algorithmic bias. Fairness auditing addresses these risks through two primary functions: certification, which verifies adherence to fairness constraints; and flagging, which isolates specific demographic groups experiencing disparate treatment. However, existing auditing techniques are frequently limited by restrictive distributional assumptions or prohibitive computational overhead. We propose a novel empirical likelihood‑based (EL) framework that constructs robust statistical measures for model performance disparities. Unlike traditional methods, our approach is non‑parametric; the proposed disparity statistics follow asymptotically chi‑square or mixed chi‑square distributions, ensuring valid inference without assuming underlying data distributions. This framework uses a constrained optimization profile that admits stable numerical solutions, facilitating both large‑scale certification and efficient subpopulation discovery. Empirically, the EL methods outperform bootstrap‑based approaches, yielding coverage rates closer to nominal levels while reducing computational latency by several orders of magnitude. We demonstrate the practical utility of this framework on the COMPAS dataset, where it successfully flags intersectional biases, specifically identifying a significantly higher positive prediction rate for African‑American males under 25 and a systemic under‑prediction for Caucasian females relative to the population mean.

Authors:Jinren Ding, Xuejian Xu, Shen Jiang, Zhitong Hao, Jinhui Yang, Peng Jiang
Title: C2:Cross learning module enhanced decision transformer with Constraint-aware loss for auto-bidding
Abstract:
Decision Transformer (DT) shows promise for generative auto‑bidding by capturing temporal dependencies, but suffers from two critical limitations: insufficient cross‑correlation modeling among state, action, and return‑to‑go (RTG) sequences, and indiscriminate learning of optimal/suboptimal behaviors. To address these, we propose C2, a novel framework enhancing DT with two core innovations: (1) a Cross Learning Block (CLB) via cross‑attention to strengthen inter‑sequence correlation modeling; (2) a Constraint‑aware Loss (CL) incorporating budget and Cost‑Per‑Acquisition (CPA) constraints for selective learning of optimal trajectories. Extensive offline evaluations on the AuctionNet dataset demonstrate consistent performance gains (up to 3.2% over state‑of‑the‑art method) across diverse budget settings; ablation studies verify the complementary synergy of CLB and CL, confirming C2's superiority in auto‑bidding. The code for reproducing our results is available at: https://github.com/Dingjinren/C2.

Authors:Jim Maar, Denis Paperno, Callum Stuart McDougall, Neel Nanda
Title: What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering
Abstract:
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross‑layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "‑ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.

Authors:Richard Csaky
Title: Scaling Next-Brain-Token Prediction for MEG
Abstract:
We present a large autoregressive model for source‑space MEG that scales next‑token prediction to long context across datasets and scanners: handling a corpus of over 500 hours and thousands of sessions across the three largest MEG datasets. A modified SEANet‑style vector‑quantizer reduces multichannel MEG into a flattened token stream on which we train a Qwen2.5‑VL backbone from scratch to predict the next brain token and to recursively generate minutes of MEG from up to a minute of context. To evaluate long‑horizon generation, we introduce task‑matched tests: (i) on‑manifold stability via generated‑only drift compared to the time‑resolved distribution of real sliding windows, and (ii) conditional specificity via correct context versus prompt‑swap controls using a neurophysiologically grounded metric set. We train on CamCAN and Omega and run all analyses on held‑out MOUS, establishing cross‑dataset generalization. Across metrics, generations remain relatively stable over long rollouts and are closer to the correct continuation than swapped controls. Code available at: https://github.com/ricsinaruto/brain‑gen.

Authors:Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot
Title: Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models
Abstract:
Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine‑tune and evaluate Granite‑3.3‑2B‑Instruct and Qwen‑3‑4B‑Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure (‑1, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine‑tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs \approx ‑0.25 to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple‑choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open‑ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic‑Slice/rl‑abstention.

Authors:Atik Faysal, Mohammad Rostami, Reihaneh Gh. Roshan, Nikhil Muralidhar, Huaxia Wang
Title: Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data
Abstract:
We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi‑Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo‑labels. SSMAE introduces a validation‑driven gating mechanism that activates pseudo‑labeling only after the model achieves reliable, high‑confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR‑10 and CIFAR‑100, SSMAE consistently outperforms supervised ViT and fine‑tuned MAE, with the largest gains in low‑label regimes (+9.24% over ViT on CIFAR‑10 with 10% labels). Our results demonstrate that when pseudo‑labels are introduced is as important as how they are generated for data‑efficient transformer training. Codes are available at https://github.com/atik666/ssmae.

Authors:Fang Li
Title: Structural Compositional Function Networks: Interpretable Functional Compositions for Tabular Discovery
Abstract:
Despite the ubiquity of tabular data in high‑stakes domains, traditional deep learning architectures often struggle to match the performance of gradient‑boosted decision trees while maintaining scientific interpretability. Standard neural networks typically treat features as independent entities, failing to exploit the inherent manifold structural dependencies that define tabular distributions. We propose Structural Compositional Function Networks (StructuralCFN), a novel architecture that imposes a Relation‑Aware Inductive Bias via a differentiable structural prior. StructuralCFN explicitly models each feature as a mathematical composition of its counterparts through Differentiable Adaptive Gating, which automatically discovers the optimal activation physics (e.g., attention‑style filtering vs. inhibitory polarity) for each relationship. Our framework enables Structured Knowledge Integration, allowing domain‑specific relational priors to be injected directly into the architecture to guide discovery. We evaluate StructuralCFN across a rigorous 10‑fold cross‑validation suite on 18 benchmarks, demonstrating statistically significant improvements (p < 0.05) on scientific and clinical datasets (e.g., Blood Transfusion, Ozone, WDBC). Furthermore, StructuralCFN provides Intrinsic Symbolic Interpretability: it recovers the governing "laws" of the data manifold as human‑readable mathematical expressions while maintaining a compact parameter footprint (300‑‑2,500 parameters) that is over an order of magnitude (10x‑‑20x) smaller than standard deep baselines.

Authors:Haoyuan Deng, Yuanjiang Xue, Haoyang Du, Boyang Zhou, Zhenyu Wu, Ziwei Wang
Title: E2HiL: Entropy-Guided Sample Selection for Efficient Real-World Human-in-the-Loop Reinforcement Learning
Abstract:
Human‑in‑the‑loop guidance has emerged as an effective approach for enabling faster convergence in online reinforcement learning (RL) of complex real‑world manipulation tasks. However, existing human‑in‑the‑loop RL (HiL‑RL) frameworks often suffer from low sample efficiency, requiring substantial human interventions to achieve convergence and thereby leading to high labor costs. To address this, we propose a sample‑efficient real‑world human‑in‑the‑loop RL framework named \method, which requires fewer human intervention by actively selecting informative samples. Specifically, stable reduction of policy entropy enables improved trade‑off between exploration and exploitation with higher sample efficiency. We first build influence functions of different samples on the policy entropy, which is efficiently estimated by the covariance of action probabilities and soft advantages of policies. Then we select samples with moderate values of influence functions, where shortcut samples that induce sharp entropy drops and noisy samples with negligible effect are pruned. Extensive experiments on four real‑world manipulation tasks demonstrate that \method achieves a 42.1% higher success rate while requiring 10.1% fewer human interventions compared to the state‑of‑the‑art HiL‑RL method, validating its effectiveness. The project page providing code, videos, and mathematical formulations can be found at https://e2hil.github.io/.

Authors:Mingxuan Luo, Guipeng Xv, Sishuo Chen, Xinyu Li, Li Zhang, Zhangming Chan, Xiang-Rong Sheng, Han Zhu, Jian Xu, Bo Zheng, Chen Lin
Title: Modeling Cascaded Delay Feedback for Online Net Conversion Rate Prediction: Benchmark, Insights and Solutions
Abstract:
In industrial recommender systems, conversion rate (CVR) is widely used for traffic allocation, but it fails to fully reflect recommendation effectiveness because it ignores refund behavior. To better capture true user satisfaction and business value, net conversion rate (NetCVR), defined as the probability that a clicked item is purchased and not refunded, has been proposed.Unlike CVR, NetCVR prediction involves a more complex multi‑stage cascaded delayed feedback process. The two cascaded delays from click to conversion and from conversion to refund have opposite effects, making traditional CVR modeling methods inapplicable. Moreover, the lack of open‑source datasets and online continuous training schemes further hinders progress in this area.To address these challenges, we introduce CASCADE (Cascaded Sequences of Conversion and Delayed Refund), the first large‑scale open dataset derived from the Taobao app for online continuous NetCVR prediction. Through an in‑depth analysis of CASCADE, we identify three key insights: (1) NetCVR exhibits strong temporal dynamics, necessitating online continuous modeling; (2) cascaded modeling of CVR and refund rate outperforms direct NetCVR modeling; and (3) delay time, which correlates with both CVR and refund rate, is an important feature for NetCVR prediction.Based on these insights, we propose TESLA, a continuous NetCVR modeling framework featuring a CVR‑refund‑rate cascaded architecture, stage‑wise debiasing, and a delay‑time‑aware ranking loss. Extensive experiments demonstrate that TESLA consistently outperforms state‑of‑the‑art methods on CASCADE, achieving absolute improvements of 12.41 percent in RI‑AUC and 14.94 percent in RI‑PRAUC on NetCVR prediction. The code and dataset are publicly available at https://github.com/alimama‑tech/NetCVR.

Authors:Nikhil Raghav, Avisek Gupta, Swagatam Das, Md Sahidullah
Title: MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization
Abstract:
Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers. Although unsupervised speaker diarization is inherently challenging, the prospect of identifying speaker regions without pretraining or weak supervision motivates research on clustering techniques. In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings to thereafter craft a sparse graph for spectral clustering in a principled manner is sufficient to achieve state‑of‑the‑art performances in a fully unsupervised setting. Specifically, we consider four polynomial kernels and a degree one arccosine kernel to measure similarities in speaker embeddings, using which sparse graphs are constructed in a principled manner to emphasize local similarities. Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD‑III, AMI, and VoxConverse corpora. To encourage further research, our implementations are available at https://github.com/nikhilraghav29/MK‑SGC‑SC.

Authors:Yuhao Li
Title: Emergent Specialization in Learner Populations: Competition as the Source of Diversity
Abstract:
How can populations of learners develop coordinated, diverse behaviors without explicit communication or diversity incentives? We demonstrate that competition alone is sufficient to induce emergent specialization ‑‑ learners spontaneously partition into specialists for different environmental regimes through competitive dynamics, consistent with ecological niche theory. We introduce the NichePopulation algorithm, a simple mechanism combining competitive exclusion with niche affinity tracking. Validated across six real‑world domains (cryptocurrency trading, commodity prices, weather forecasting, solar irradiance, urban traffic, and air quality), our approach achieves a mean Specialization Index of 0.75 with effect sizes of Cohen's d > 20. Key findings: (1) At lambda=0 (no niche bonus), learners still achieve SI > 0.30, proving specialization is genuinely emergent; (2) Diverse populations outperform homogeneous baselines by +26.5% through method‑level division of labor; (3) Our approach outperforms MARL baselines (QMIX, MAPPO, IQL) by 4.3x while being 4x faster.

Authors:Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge
Title: OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
Abstract:
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real‑world tasks. To bridge this gap, we propose OPT‑ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT‑ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed‑Integer Programming. Utilizing OPT‑ENGINE, we conduct an extensive study of LLMs' reasoning capabilities, addressing two critical questions: 1.) Do LLMs' performance remain robust when generalizing to out‑of‑distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool‑integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure‑text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next‑generation LLMs for advanced optimization. Our code is publicly available at \textcolorbluehttps://github.com/Cardinal‑Operations/OPTEngine.

Authors:Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, Tong Wei, Min-Ling Zhang
Title: KeepLoRA: Continual Learning with Residual Gradient Adaptation
Abstract:
Continual learning for pre‑trained vision‑language models requires balancing three competing objectives: retaining pre‑trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents a simple but effective approach called KeepLoRA to effectively balance these objectives. We first analyze the knowledge retention mechanism within the model parameter space and find that general knowledge is mainly encoded in the principal subspace, while task‑specific knowledge is encoded in the residual subspace. Motivated by this finding, KeepLoRA learns new tasks by restricting LoRA parameter updates in the residual subspace to prevent interfering with previously learned capabilities. Specifically, we infuse knowledge for a new task by projecting its gradient onto a subspace orthogonal to both the principal subspace of pre‑trained model and the dominant directions of previous task features. Our theoretical and empirical analyses confirm that KeepLoRA balances the three objectives and achieves state‑of‑the‑art performance. The implementation code is available at https://github.com/MaolinLuo/KeepLoRA.

Authors:Yongqi Wang, Xiaofeng Ji, Jie Wang, Qingbin Li, Xiao Xiong, Zheming Yang, Jian Xu, Minghui Qiu, Xinxiao Wu
Title: From Atoms to Chains: Divergence-Guided Reasoning Curriculum for Unlabeled LLM Domain Adaptation
Abstract:
Adapting Large Language Models (LLMs) to specialized domains without human‑annotated data is a crucial yet formidable challenge. Widely adopted knowledge distillation methods often devolve into coarse‑grained mimicry, where the student model inefficiently targets its own weaknesses and risks inheriting the teacher's reasoning flaws. This exposes a critical pedagogical dilemma: how to devise a reliable curriculum when the teacher itself is not an infallible expert. Our work resolves this by capitalizing on a key insight: while LLMs may exhibit fallibility in complex, holistic reasoning, they often exhibit high fidelity on focused, atomic sub‑problems. Based on this, we propose Divergence‑Guided Reasoning Curriculum (DGRC), which constructs a learning path from atomic knowledge to reasoning chains by dynamically deriving two complementary curricula from disagreements in reasoning pathways. When a student and teacher produce conflicting results, DGRC directs the teacher to perform a diagnostic analysis: it analyzes both reasoning paths to formulate atomic queries that target the specific points of divergence, and then self‑answers these queries to create high‑confidence atomic question‑answer pairs. These pairs then serve a dual purpose: (1) providing an atomic curriculum to rectify the student's knowledge gaps, and (2) serving as factual criteria to filter the teacher's original reasoning chains, yielding a verified CoT curriculum that teaches the student how to integrate atomic knowledge into complete reasoning paths. Experiments across the medical and legal domains on student models of various sizes demonstrate the effectiveness of our DGRC framework. Notably, our method achieves a 7.76% relative improvement for the 1.5B student model in the medical domain over strong unlabeled baseline.

Authors:Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang
Title: LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment
Abstract:
Safety‑aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over‑refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade‑off ‑‑ reducing jailbreak increases over‑refusal and vice versa. We identify the root cause: LLMs encode the decision to answer (answer vector v_a) and the judgment of input safety (benign vector v_b) as nearly orthogonal directions, treating them as independent processes. We propose LLM‑VA, which aligns v_a with v_b through closed‑form weight updates, making the model's willingness to answer causally dependent on its safety assessment ‑‑ without fine‑tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety‑relevant layers, and iteratively aligns vectors via minimum‑norm weight modifications. Experiments on 12 LLMs demonstrate that LLM‑VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model's safety bias without manual tuning. Code and models are available at https://hotbento.github.io/LLM‑VA‑Web/.

Authors:Hongzhu Yi, Xinming Wang, Zhenghao zhang, Tianyu Zong, Yuanxiang Wang, Jun Xie, Tao Yu, Haopeng Jin, Zhepeng Wang, Kaixin Xu, Feng Chen, Jiahuan Chen, Yujia Yang, Zhenyu Guan, Bingkang Shi, Jungang Xu
Title: RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization
Abstract:
Within the domain of large language models, reinforcement fine‑tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine‑Tuning with Partial Reasoning Optimization (RPO), a plug‑and‑play reinforcement fine‑tuning algorithm. Unlike traditional reinforcement fine‑tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full‑path reinforcement fine‑tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open‑sourced at https://github.com/yhz5613813/RPO.

Authors:Viacheslav Sydora, Guner Dilsad Er, Michael Muehlebach
Title: Teaching Machine Learning Fundamentals with LEGO Robotics
Abstract:
This paper presents the web‑based platform Machine Learning with Bricks and an accompanying two‑day course designed to teach machine learning concepts to students aged 12 to 17 through programming‑free robotics activities. Machine Learning with Bricks is an open source platform and combines interactive visualizations with LEGO robotics to teach three core algorithms: KNN, linear regression, and Q‑learning. Students learn by collecting data, training models, and interacting with robots via a web‑based interface. Pre‑ and post‑surveys with 14 students demonstrate significant improvements in conceptual understanding of machine learning algorithms, positive shifts in AI perception, high platform usability, and increased motivation for continued learning. This work demonstrates that tangible, visualization‑based approaches can make machine learning concepts accessible and engaging for young learners while maintaining technical depth. The platform is freely available at https://learning‑and‑dynamics.github.io/ml‑with‑bricks/, with video tutorials guiding students through the experiments at https://youtube.com/playlist?list=PLx1grFu4zAcwfKKJZ1Ux4LwRqaePCOA2J.

Authors:Quy-Anh Dang, Chris Ngo
Title: Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
Abstract:
Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference‑time intervention approach, but existing methods suffer from critical limitations: activation addition requires careful coefficient tuning and is sensitive to layer‑specific norm variations, while directional ablation provides only binary control. Recent work on Angular Steering introduces continuous control via rotation in a 2D subspace, but its practical implementation violates norm preservation, causing distribution shift and generation collapse, particularly in models below 7B parameters. We propose Selective Steering, which addresses these limitations through two key innovations: (1) a mathematically rigorous norm‑preserving rotation formulation that maintains activation distribution integrity, and (2) discriminative layer selection that applies steering only where feature representations exhibit opposite‑signed class alignment. Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods while maintaining zero perplexity violations and approximately 100% capability retention on standard benchmarks. Our approach provides a principled, efficient framework for controllable and stable LLM behavior modification. Code: https://github.com/knoveleng/steering

Authors:Zhao-Han Peng, Shaohui Li, Zhi Li, Shulan Ruan, Yu Liu, You He
Title: From Observations to Events: Event-Aware World Model for Reinforcement Learning
Abstract:
While model‑based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision‑making. Motivated by this principle, we propose the Event‑Aware World Model (EAWM), a general framework that learns event‑aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio‑temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC‑GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10%‑45%, setting new state‑of‑the‑art results across benchmarks. Our code is released at https://github.com/MarquisDarwin/EAWM.

Authors:Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron
Title: StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths
Abstract:
Quantization‑aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra‑low bitwidths remains challenging. Common approaches based on the straight‑through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low‑bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2‑4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at https://github.com/microsoft/StableQAT.

Authors:Alexandre Alouadi, Pierre Henry-Labordère, Grégoire Loeper, Othmane Mazhar, Huyên Pham, Nizar Touzi
Title: LightSBB-M: Bridging Schrödinger and Bass for Generative Diffusion Modeling
Abstract:
The Schrodinger Bridge and Bass (SBB) formulation, which jointly controls drift and volatility, is an established extension of the classical Schrodinger Bridge (SB). Building on this framework, we introduce LightSBB‑M, an algorithm that computes the optimal SBB transport plan in only a few iterations. The method exploits a dual representation of the SBB objective to obtain analytic expressions for the optimal drift and volatility, and it incorporates a tunable parameter beta greater than zero that interpolates between pure drift (the Schrodinger Bridge) and pure volatility (Bass martingale transport). We show that LightSBB‑M achieves the lowest 2‑Wasserstein distance on synthetic datasets against state‑of‑the‑art SB and diffusion baselines with up to 32 percent improvement. We also illustrate the generative capability of the framework on an unpaired image‑to‑image translation task (adult to child faces in FFHQ). These findings demonstrate that LightSBB‑M provides a scalable, high‑fidelity SBB solver that outperforms existing SB and diffusion baselines across both synthetic and real‑world generative tasks. The code is available at https://github.com/alexouadi/LightSBB‑M.

Authors:Shengjia Zhang, Weiqin Yang, Jiawei Chen, Peng Wu, Yuegang Sun, Gang Wang, Qihao Shi, Can Wang
Title: Talos: Optimizing Top-$K$ Accuracy in Recommender Systems
Abstract:
Recommender systems (RS) aim to retrieve a small set of items that best match individual user preferences. Naturally, RS place primary emphasis on the quality of the Top‑K results rather than performance across the entire item set. However, estimating Top‑K accuracy (e.g., Precision@K, Recall@K) requires determining the ranking positions of items, which imposes substantial computational overhead and poses significant challenges for optimization. In addition, RS often suffer from distribution shifts due to evolving user preferences or data biases, further complicating the task. To address these issues, we propose Talos, a loss function that is specifically designed to optimize the Talos recommendation accuracy. Talos leverages a quantile technique that replaces the complex ranking‑dependent operations into simpler comparisons between predicted scores and learned score thresholds. We further develop a sampling‑based regression algorithm for efficient and accurate threshold estimation, and introduce a constraint term to maintain optimization stability by preventing score inflation. Additionally, we incorporate a tailored surrogate function to address discontinuity and enhance robustness against distribution shifts. Comprehensive theoretical analyzes and empirical experiments are conducted to demonstrate the effectiveness, efficiency, convergence, and distributional robustness of Talos. The code is available at https://github.com/cynthia‑shengjia/WWW‑2026‑Talos.

Authors:Qi Si, Xuyang Liu, Penglei Wang, Xin Guo, Yuan Qi, Yuan Cheng
Title: Structure-based RNA Design by Step-wise Optimization of Latent Diffusion Model
Abstract:
RNA inverse folding, designing sequences to form specific 3D structures, is critical for therapeutics, gene regulation, and synthetic biology. Current methods, focused on sequence recovery, struggle to address structural objectives like secondary structure consistency (SS), minimum free energy (MFE), and local distance difference test (LDDT), leading to suboptimal structural accuracy. To tackle this, we propose a reinforcement learning (RL) framework integrated with a latent diffusion model (LDM). Drawing inspiration from the success of diffusion models in RNA inverse folding, which adeptly model complex sequence‑structure interactions, we develop an LDM incorporating pre‑trained RNA‑FM embeddings from a large‑scale RNA model. These embeddings capture co‑evolutionary patterns, markedly improving sequence recovery accuracy. However, existing approaches, including diffusion‑based methods, cannot effectively handle non‑differentiable structural objectives. By contrast, RL excels in this task by using policy‑driven reward optimization to navigate complex, non‑gradient‑based objectives, offering a significant advantage over traditional methods. In summary, we propose the Step‑wise Optimization of Latent Diffusion Model (SOLD), a novel RL framework that optimizes single‑step noise without sampling the full diffusion trajectory, achieving efficient refinement of multiple structural objectives. Experimental results demonstrate SOLD surpasses its LDM baseline and state‑of‑the‑art methods across all metrics, establishing a robust framework for RNA inverse folding with profound implications for biotechnological and therapeutic applications.

Authors:Chaozheng Wen, Jingwen Tong, Zehong Lin, Chenghong Bian, Jun Zhang
Title: Bridging Visual and Wireless Sensing: A Unified Radiation Field for 3D Radio Map Construction
Abstract:
The emerging applications of next‑generation wireless networks (e.g., immersive 3D communication, low‑altitude networks, and integrated sensing and communication) necessitate high‑fidelity environmental intelligence. 3D radio maps have emerged as a critical tool for this purpose, enabling spectrum‑aware planning and environment‑aware sensing by bridging the gap between physical environments and electromagnetic signal propagation. However, constructing accurate 3D radio maps requires fine‑grained 3D geometric information and a profound understanding of electromagnetic wave propagation. Existing approaches typically treat optical and wireless knowledge as distinct modalities, failing to exploit the fundamental physical principles governing both light and electromagnetic propagation. To bridge this gap, we propose URF‑GS, a unified radio‑optical radiation field representation framework for accurate and generalizable 3D radio map construction based on 3D Gaussian splatting (3D‑GS) and inverse rendering. By fusing visual and wireless sensing observations, URF‑GS recovers scene geometry and material properties while accurately predicting radio signal behavior at arbitrary transmitter‑receiver (Tx‑Rx) configurations. Experimental results demonstrate that URF‑GS achieves up to a 24.7% improvement in spatial spectrum prediction accuracy and a 10x increase in sample efficiency for 3D radio map construction compared with neural radiance field (NeRF)‑based methods. This work establishes a foundation for next‑generation wireless networks by integrating perception, interaction, and communication through holistic radiation field reconstruction.

Authors:Benjamin Turtel, Paul Wilczewski, Danny Franklin, Kris Skotheim
Title: Foresight Learning for SEC Risk Prediction
Abstract:
Risk disclosures in SEC filings describe potential adverse events but rarely quantify their likelihood, limiting their usefulness for probabilistic analysis. A central obstacle is the absence of large‑scale, risk‑level supervision linking disclosed risks to realized outcomes. We introduce a fully automated data generation pipeline that converts qualitative SEC risk disclosures into temporally grounded supervision using only public data. For each filing, the pipeline generates firm‑specific, time‑bounded risk queries from the Risk Factors section and labels them by automatically resolving outcomes against subsequent disclosures. Using this dataset of risk queries and outcomes grounded in SEC filings, we train a compact large language model to estimate the probability that a disclosed risk will materialize within a specified horizon. Despite its modest size, the resulting model substantially improves over pretrained and heuristic baselines, and outperforms frontier general‑purpose models, including GPT‑5, on probabilistic accuracy and calibration. More broadly, this work demonstrates that Foresight Learning enables scalable and fully automated training of domain‑specific expert models using only raw, chronological, in‑domain text ‑‑ without proprietary data, external corpora, or manual annotation. The resulting models achieve frontier‑level performance while remaining deployable on a single GPU. This result suggests a general pathway for learning calibrated, decision‑relevant signals from naturally occurring enterprise documents. To support transparency and reproducibility, we open‑source the evaluation dataset used in this study. Evaluation Data: https://huggingface.co/datasets/LightningRodLabs/sec_risk_questions_test_set Data Generation Platform: https://lightningrod.ai/ SDK: https://github.com/lightning‑rod‑labs/lightningrod‑python‑sdk

Authors:Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban
Title: C2NP: A Benchmark for Learning Scale-Dependent Geometric Invariances in 3D Materials Generation
Abstract:
Generative models for materials have achieved strong performance on periodic bulk crystals, yet their ability to generalize across scale transitions to finite nanostructures remains largely untested. We introduce Crystal‑to‑Nanoparticle (C2NP), a systematic benchmark for evaluating generative models when moving between infinite crystalline unit cells and finite nanoparticles, where surface effects and size‑dependent distortions dominate. C2NP defines two complementary tasks: (i) generating nanoparticles of specified radii from periodic unit cells, testing whether models capture surface truncation and geometric constraints; and (ii) recovering bulk lattice parameters and space‑group symmetry from finite particle configurations, assessing whether models can infer underlying crystallographic order despite surface perturbations. Using diverse materials as a structurally consistent testbed, we construct over 170,000 nanoparticle configurations by carving particles from supercells derived from DFT‑relaxed crystal unit cells, and introduce size‑based splits that separate interpolation from extrapolation regimes. Experiments with state‑of‑the‑art approaches, including diffusion, flow‑matching, and variational models, show that even when losses are low, models often fail geometrically under distribution shift, yielding large lattice‑recovery errors and near‑zero joint accuracy on structure and symmetry. Overall, our results suggest that current methods rely on template memorization rather than scalable physical generalization. C2NP offers a controlled, reproducible framework for diagnosing these failures, with immediate applications to nanoparticle catalyst design, nanostructured hydrides for hydrogen storage, and materials discovery. Dataset and code are available at https://github.com/KurbanIntelligenceLab/C2NP.

Authors:Junwei Deng, Chang Xu, Jiaqi W. Ma, Ming Jin, Chenghao Liu, Jiang Bian
Title: OATS: Online Data Augmentation for Time Series Foundation Models
Abstract:
Time Series Foundation Models (TSFMs) are a powerful paradigm for time series analysis and are often enhanced by synthetic data augmentation to improve the training data quality. Existing augmentation methods, however, typically rely on heuristics and static paradigms. Motivated by dynamic data optimization, which shows that the contribution of samples varies across training stages, we propose OATS (Online Data Augmentation for Time Series Foundation Models), a principled strategy that generates synthetic data tailored to different training steps. OATS leverages valuable training samples as principled guiding signals and dynamically generates high‑quality synthetic data conditioned on them. We further design a diffusion‑based framework to produce realistic time series and introduce an explore‑exploit mechanism to balance efficiency and effectiveness. Experiments on TSFMs demonstrate that OATS consistently outperforms regular training and yields substantial performance gains over static data augmentation baselines across six validation datasets and two TSFM architectures. The code is available at the link https://github.com/microsoft/TimeCraft.

Authors:Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar
Title: FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Abstract:
We propose FROST, an attention‑aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention‑based mechanism to remove them. Theoretically, FROST preserves and enhances the model's reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi‑4‑Reasoning and GPT‑OSS‑20B), outperforming state‑of‑the‑art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and the average kurtosis by 91.09% compared to the base model. Code is available at https://github.com/robinzixuan/FROST

Authors:Fangzhou Wu, Sandeep Silwal, Qiuyi, Zhang
Title: Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective
Abstract:
KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key‑value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi‑LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade‑offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning‑based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix‑sharing settings, demonstrating improvements of up to 6.92× in cache hit rate, 11.96× reduction in latency, 14.06× reduction in time‑to‑first‑token (TTFT), and 77.4% increase in throughput over the state‑of‑the‑art methods. Our code is available at https://github.com/fzwark/KVRouting.

Authors:Yaohua Zha, Chunlin Fan, Peiyuan Liu, Yong Jiang, Tao Dai, Hai Wu, Shu-Tao Xia
Title: CP Loss: Channel-wise Perceptual Loss for Time Series Forecasting
Abstract:
Multi‑channel time‑series data, prevalent across diverse applications, is characterized by significant heterogeneity in its different channels. However, existing forecasting models are typically guided by channel‑agnostic loss functions like MSE, which apply a uniform metric across all channels. This often leads to fail to capture channel‑specific dynamics such as sharp fluctuations or trend shifts. To address this, we propose a Channel‑wise Perceptual Loss (CP Loss). Its core idea is to learn a unique perceptual space for each channel that is adapted to its characteristics, and to compute the loss within this space. Specifically, we first design a learnable channel‑wise filter that decomposes the raw signal into disentangled multi‑scale representations, which form the basis of our perceptual space. Crucially, the filter is optimized jointly with the main forecasting model, ensuring that the learned perceptual space is explicitly oriented towards the prediction task. Finally, losses are calculated within these perception spaces to optimize the model. Code is available at https://github.com/zyh16143998882/CP_Loss.

Authors:Vincent Gurgul, Ying Chen, Stefan Lessmann
Title: Variational Quantum Circuit-Based Reinforcement Learning for Dynamic Portfolio Optimization
Abstract:
This paper presents a Quantum Reinforcement Learning (QRL) solution to the dynamic portfolio optimization problem based on Variational Quantum Circuits. The implemented QRL approaches are quantum analogues of the classical neural‑network‑based Deep Deterministic Policy Gradient and Deep Q‑Network algorithms. Through an empirical evaluation on real‑world financial data, we show that our quantum agents achieve risk‑adjusted performance comparable to, and in some cases exceeding, that of classical Deep RL models with several orders of magnitude more parameters. However, while quantum circuit execution is inherently fast at the hardware level, practical deployment on cloud‑based quantum systems introduces substantial latency, making end‑to‑end runtime currently dominated by infrastructural overhead and limiting practical applicability. Taken together, our results suggest that QRL is theoretically competitive with state‑of‑the‑art classical reinforcement learning and may become practically advantageous as deployment overheads diminish. This positions QRL as a promising paradigm for dynamic decision‑making in complex, high‑dimensional, and non‑stationary environments such as financial markets. The complete codebase is released as open source at: https://github.com/VincentGurgul/qrl‑dpo‑public

Authors:Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe
Title: Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
Abstract:
Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self‑improvement framework designed to surface these pedagogical signals through meta‑RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi‑level meta‑RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self‑play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well‑posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.

Authors:Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou
Title: TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
Abstract:
Time series are ubiquitous in real‑world scenarios and crucial for applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve complex problems. However, current benchmarks for generalist models largely overlook this dimension. To bridge this gap, we introduce TSRBench, a comprehensive multi‑modal benchmark designed to stress‑test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision‑Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluate over 30 leading proprietary and open‑source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context‑aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual forms of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.

Authors:Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover
Title: Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Abstract:
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On‑policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token‑level supervision, addressing the distribution mismatch between training and inference in off‑policy distillation methods. However, on‑policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground‑truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On‑Policy Self‑Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per‑token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8‑12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off‑policy distillation methods.

Authors:Zhiwei Zheng, Kevin Bryson
Title: LaCoGSEA: Unsupervised deep learning for pathway analysis via latent correlation
Abstract:
Motivation: Pathway enrichment analysis is widely used to interpret gene expression data. Standard approaches, such as GSEA, rely on predefined phenotypic labels and pairwise comparisons, which limits their applicability in unsupervised settings. Existing unsupervised extensions, including single‑sample methods, provide pathway‑level summaries but primarily capture linear relationships and do not explicitly model gene‑pathway associations. More recently, deep learning models have been explored to capture non‑linear transcriptomic structure. However, their interpretation has typically relied on generic explainable AI (XAI) techniques designed for feature‑level attribution. As these methods are not designed for pathway‑level interpretation in unsupervised transcriptomic analyses, their effectiveness in this setting remains limited. Results: To bridge this gap, we introduce LaCoGSEA (Latent Correlation GSEA), an unsupervised framework that integrates deep representation learning with robust pathway statistics. LaCoGSEA employs an autoencoder to capture non‑linear manifolds and proposes a global gene‑latent correlation metric as a proxy for differential expression, generating dense gene rankings without prior labels. We demonstrate that LaCoGSEA offers three key advantages: (i) it achieves improved clustering performance in distinguishing cancer subtypes compared to existing unsupervised baselines; (ii) it recovers a broader range of biologically meaningful pathways at higher ranks compared with linear dimensionality reduction and gradient‑based XAI methods; and (iii) it maintains high robustness and consistency across varying experimental protocols and dataset sizes. Overall, LaCoGSEA provides state‑of‑the‑art performance in unsupervised pathway enrichment analysis. Availability and implementation: https://github.com/willyzzz/LaCoGSEA

Authors:Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, Sung Ju Hwang
Title: Self-Refining Video Sampling
Abstract:
Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine‑grained motion. In this work, we present self‑refining video sampling, a simple method that uses a pre‑trained video generator trained on large‑scale datasets as its own self‑refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner‑loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty‑aware refinement strategy that selectively refines regions based on self‑consistency, which prevents artifacts caused by over‑refinement. Experiments on state‑of‑the‑art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance‑based sampler.

Authors:Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi
Title: Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
Abstract:
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just‑In‑Time Reinforcement Learning (JitRL), a training‑free framework that enables test‑time policy optimization without any gradient updates. JitRL maintains a dynamic, non‑parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on‑the‑fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed‑form solution to the KL‑constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state‑of‑the‑art among training‑free methods. Crucially, JitRL outperforms the performance of computationally expensive fine‑tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.

Authors:Tianyi Gong, Can Han, Junxi Wu, Dahong Qian
Title: Fusion of Spatio-Temporal and Multi-Scale Frequency Features for Dry Electrodes MI-EEG Decoding
Abstract:
Dry‑electrode Motor Imagery Electroencephalography (MI‑EEG) enables fast, comfortable, real‑world Brain Computer Interface by eliminating gels and shortening setup for at‑home and wearable use.However, dry recordings pose three main issues: lower Signal‑to‑Noise Ratio with more baseline drift and sudden transients; weaker and noisier data with poor phase alignment across trials; and bigger variances between sessions. These drawbacks lead to larger data distribution shift, making features less stable for MI‑EEG tasks.To address these problems, we introduce STGMFM, a tri‑branch framework tailored for dry‑electrode MI‑EEG, which models complementary spatio‑temporal dependencies via dual graph orders, and captures robust envelope dynamics with a multi‑scale frequency mixing branch, motivated by the observation that amplitude envelopes are less sensitive to contact variability than instantaneous waveforms. Physiologically meaningful connectivity priors guide learning, and decision‑level fusion consolidates a noise‑tolerant consensus. On our collected dry‑electrode MI‑EEG, STGMFM consistently surpasses competitive CNN/Transformer/graph baselines. Codes are available at https://github.com/Tianyi‑325/STGMFM.

Authors:Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang
Title: Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning
Abstract:
Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi‑hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed‑source APIs, limiting flexibility and scalability. We propose Temp‑R1, the first autonomous end‑to‑end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single‑action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B‑parameter Temp‑R1 achieves state‑of‑the‑art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. Our code will be publicly available soon at https://github.com/zjukg/Temp‑R1.

Authors:Longwei Ding, Anhao Zhao, Fanghua Ye, Ziyang Chen, Xiaoyu Shen
Title: From LLMs to LRMs: Rethinking Pruning for Reasoning-Centric Models
Abstract:
Large language models (LLMs) are increasingly costly to deploy, motivating extensive research on model pruning. However, most existing studies focus on instruction‑following LLMs, leaving it unclear whether established pruning strategies transfer to reasoning‑augmented models that explicitly generate long intermediate reasoning traces. In this work, we conduct a controlled study of pruning for both instruction‑following (LLM‑instruct) and reasoning‑augmented (LLM‑think) models. To isolate the effects of pruning, we align pruning calibration and post‑pruning recovery data with each model's original training distribution, which we show yields more stable and reliable pruning behavior. We evaluate static depth pruning, static width pruning, and dynamic pruning across 17 tasks spanning classification, generation, and reasoning. Our results reveal clear paradigm‑dependent differences: depth pruning outperforms width pruning on classification tasks, while width pruning is more robust for generation and reasoning. Moreover, static pruning better preserves reasoning performance, whereas dynamic pruning excels on classification and generation but remains challenging for long‑chain reasoning. These findings underscore the need for pruning strategies that explicitly account for the distinct characteristics of reasoning‑augmented LLMs. Our code is publicly available at https://github.com/EIT‑NLP/LRM‑Pruning.

Authors:Peixuan Han, Yingjie Yu, Jingjun Xu, Jiaxuan You
Title: DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal
Abstract:
Despite the growing adoption of large language models (LLMs) in scientific research workflows, automated support for academic rebuttal, a crucial step in academic communication and peer review, remains largely underexplored. Existing approaches typically rely on off‑the‑shelf LLMs or simple pipelines, which struggle with long‑context understanding and often fail to produce targeted and persuasive responses. In this paper, we propose DRPG, an agentic framework for automatic academic rebuttal generation that operates through four steps: Decompose reviews into atomic concerns, Retrieve relevant evidence from the paper, Plan rebuttal strategies, and Generate responses accordingly. Notably, the Planner in DRPG reaches over 98% accuracy in identifying the most feasible rebuttal direction. Experiments on data from top‑tier conferences demonstrate that DRPG significantly outperforms existing rebuttal pipelines and achieves performance beyond the average human level using only an 8B model. Our analysis further demonstrates the effectiveness of the planner design and its value in providing multi‑perspective and explainable suggestions. We also showed that DRPG works well in a more complex multi‑round setting. These results highlight the effectiveness of DRPG and its potential to provide high‑quality rebuttal content and support the scaling of academic discussions. Codes for this work are available at https://github.com/ulab‑uiuc/DRPG‑RebuttalAgent.

Authors:Brijesh FNU, Viet Thanh Duy Nguyen, Ashima Sharma, Md Harun Rashid Molla, Chengyi Xu, Truong-Son Hy
Title: Multimodal Machine Learning for Soft High-k Elastomers under Data Scarcity
Abstract:
Dielectric materials are critical building blocks for modern electronics such as sensors, actuators, and transistors. With the rapid recent advance in soft and stretchable electronics for emerging human‑ and robot‑interfacing applications, there is a surging need for high‑performance dielectric elastomers. However, it remains a grand challenge to develop soft elastomers that simultaneously possess high dielectric constants (k, related to energy storage capacity) and low Young's moduli (E, related to mechanical flexibility). While some new elastomer designs have been reported in individual (mostly one‑off) studies, almost no structured dataset is currently available for dielectric elastomers that systematically encompasses their molecular sequence, dielectric, and mechanical properties. Within this context, we curate a compact, high‑quality dataset of acrylate‑based dielectric elastomers, one of the most widely explored elastomer backbones due to its versatile chemistry and molecular design flexibility, by screening and aggregating experimental results from the literature over the past 10 years. Building on this dataset, we propose a multimodal learning framework that leverages large‑scale pretrained polymer representations from graph‑ and sequence‑based encoders. These pretrained embeddings transfer rich chemical and structural knowledge from vast polymer corpora, enabling accurate few‑shot prediction of both dielectric and mechanical properties from molecular sequences. Our results represent a new paradigm for transferring knowledge from pretrained multimodal models to overcome severe data scarcity, which can be readily translated to other polymer backbones (e.g., silicones, urethanes) and thus accelerate data‑efficient discovery of soft high‑k dielectric elastomers. Our source code and dataset are publicly available at https://github.com/HySonLab/Polymers

Authors:Zhongyu Xiao, Zhiwei Hao, Jianyuan Guo, Yong Luo, Jia Liu, Jie Xu, Han Hu
Title: Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding
Abstract:
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block‑wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative‑sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming‑dLLM, a training‑free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming‑dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming‑dLLM.

Authors:Yixin Liu, Kehan Yan, Shiyuan Li, Qingfeng Chen, Shirui Pan
Title: Beyond a Single Perspective: Text Anomaly Detection with Multi-View Language Representations
Abstract:
Text anomaly detection (TAD) plays a critical role in various language‑driven real‑world applications, including harmful content moderation, phishing detection, and spam review filtering. While two‑step "embedding‑detector" TAD methods have shown state‑of‑the‑art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types. To address these limitations, we propose to exploit the embeddings from multiple pretrained language models and integrate them into MCA^2, a multi‑view TAD framework. MCA^2 adopts a multi‑view reconstruction model to effectively extract normal textual patterns from multiple embedding perspectives. To exploit inter‑view complementarity, a contrastive collaboration module is designed to leverage and strengthen the interactions across different views. Moreover, an adaptive allocation module is developed to automatically assign the contribution weight of each view, thereby improving the adaptability to diverse datasets. Extensive experiments on 10 benchmark datasets verify the effectiveness of MCA^2 against strong baselines. The source code of MCA^2 is available at https://github.com/yankehan/MCA2.

Authors:Raja Gond, Aditya K Kamath, Arkaprava Basu, Ramachandran Ramjee, Ashish Panwar
Title: LLM-42: Enabling Determinism in LLM Inference with Verified Speculation
Abstract:
In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non‑determinism arises from floating‑point non‑associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non‑determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch‑invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM‑42, a scheduling‑based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape‑consistent reductions. Leveraging these insights, LLM‑42 decodes tokens using a non‑deterministic fast path and enforces determinism via a lightweight verify‑rollback loop. The verifier replays candidate tokens under a fixed‑shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM‑42 mostly re‑uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism.

Authors:Rahul Bera, Zhenrong Lang, Caroline Hengartner, Konstantinos Kanellopoulos, Rakesh Kumar, Mohammad Sadrosadati, Onur Mutlu
Title: Athena: Synergizing Data Prefetching and Off-Chip Prediction via Online Reinforcement Learning
Abstract:
Prefetching and off‑chip prediction are two techniques proposed to hide long memory access latencies in high‑performance processors. In this work, we demonstrate that: (1) prefetching and off‑chip prediction often provide complementary performance benefits, yet (2) naively combining them often fails to realize their full performance potential, and (3) existing prefetcher control policies leave significant room for performance improvement behind. Our goal is to design a holistic framework that can autonomously learn to coordinate an off‑chip predictor with multiple prefetchers employed at various cache levels. To this end, we propose a new technique called Athena, which models the coordination between prefetchers and off‑chip predictor (OCP) as a reinforcement learning (RL) problem. Athena acts as the RL agent that observes multiple system‑level features (e.g., prefetcher/OCP accuracy, bandwidth usage) over an epoch of program execution, and uses them as state information to select a coordination action (i.e., enabling the prefetcher and/or OCP, and adjusting prefetcher aggressiveness). At the end of every epoch, Athena receives a numerical reward that measures the change in multiple system‑level metrics (e.g., number of cycles taken to execute an epoch). Athena uses this reward to autonomously and continuously learn a policy to coordinate prefetchers with OCP. Our extensive evaluation using a diverse set of memory‑intensive workloads shows that Athena consistently outperforms prior state‑of‑the‑art coordination policies across a wide range of system configurations with various combinations of underlying prefetchers, OCPs, and main memory bandwidths, while incurring only modest storage overhead. Athena is freely available at https://github.com/CMU‑SAFARI/Athena.

Authors:Sebastian Doerrich, Francesco Di Salvo, Jonas Alle, Christian Ledig
Title: Stylizing ViT: Anatomy-Preserving Instance Style Transfer for Domain Generalization
Abstract:
Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight‑shared attention blocks for both self‑ and cross‑attention. This design allows the same attention block to maintain anatomical consistency through self‑attention while performing style transfer via cross‑attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test‑time augmentation. The source code is available at https://github.com/sdoerrich97/stylizing‑vit .

Authors:Aadam, Monu Verma, Mohamed Abdel-Mottaleb
Title: JaxARC: A High-Performance JAX-based Environment for Abstraction and Reasoning Research
Abstract:
The Abstraction and Reasoning Corpus (ARC) tests AI systems' ability to perform human‑like inductive reasoning from a few demonstration pairs. Existing Gymnasium‑based RL environments severely limit experimental scale due to computational bottlenecks. We present JaxARC, an open‑source, high‑performance RL environment for ARC implemented in JAX. Its functional, stateless architecture enables massive parallelism, achieving 38‑5,439x speedup over Gymnasium at matched batch sizes, with peak throughput of 790M steps/second. JaxARC supports multiple ARC datasets, flexible action spaces, composable wrappers, and configuration‑driven reproducibility, enabling large‑scale RL research previously computationally infeasible. JaxARC is available at https://github.com/aadimator/JaxARC.

Authors:Silong Chen, Yuchuan Luo, Guilin Deng, Yi Liu, Min Xu, Shaojing Fu, Xiaohua Jia
Title: Reconstructing Training Data from Adapter-based Federated Large Language Models
Abstract:
Adapter‑based Federated Large Language Models (FedLLMs) are widely adopted to reduce the computational, storage, and communication overhead of full‑parameter fine‑tuning for web‑scale applications while preserving user privacy. By freezing the backbone and training only compact low‑rank adapters, these methods appear to limit gradient leakage and thwart existing Gradient Inversion Attacks (GIAs). Contrary to this assumption, we show that low‑rank adapters create new, exploitable leakage channels. We propose the Unordered‑word‑bag‑based Text Reconstruction (UTR) attack, a novel GIA tailored to the unique structure of adapter‑based FedLLMs. UTR overcomes three core challenges: low‑dimensional gradients, frozen backbones, and combinatorially large reconstruction spaces by: (i) inferring token presence from attention patterns in frozen layers, (ii) performing sentence‑level inversion within the low‑rank subspace of adapter gradients, and (iii) enforcing semantic coherence through constrained greedy decoding guided by language priors. Extensive experiments across diverse models (GPT2‑Large, BERT, Qwen2.5‑7B) and datasets (CoLA, SST‑2, Rotten Tomatoes) demonstrate that UTR achieves near‑perfect reconstruction accuracy (ROUGE‑1/2 > 99), even with large batch size settings where prior GIAs fail completely. Our results reveal a fundamental tension between parameter efficiency and privacy in FedLLMs, challenging the prevailing belief that lightweight adaptation inherently enhances security. Our code and data are available at https://github.com/shwksnshwowk‑wq/GIA.

Authors:Chia-Ming Lee, Yu-Fan Lin, Jin-Hui Jiang, Yu-Jou Hsiao, Chih-Chung Hsu, Yu-Lun Liu
Title: ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
Abstract:
Single Image Reflection Separation (SIRS) disentangles mixed images into transmission and reflection layers. Existing methods suffer from transmission‑reflection confusion under nonlinear mixing, particularly in deep decoder layers, due to implicit fusion mechanisms and inadequate multi‑scale coordination. We propose ReflexSplit, a dual‑stream framework with three key innovations. (1) Cross‑scale Gated Fusion (CrGF) adaptively aggregates semantic priors, texture details, and decoder context across hierarchical depths, stabilizing gradient flow and maintaining feature consistency. (2) Layer Fusion‑Separation Blocks (LFSB) alternate between fusion for shared structure extraction and differential separation for layer‑specific disentanglement. Inspired by Differential Transformer, we extend attention cancellation to dual‑stream separation via cross‑stream subtraction. (3) Curriculum training progressively strengthens differential separation through depth‑dependent initialization and epoch‑wise warmup. Extensive experiments on synthetic and real‑world benchmarks demonstrate state‑of‑the‑art performance with superior perceptual quality and robust generalization. Our code is available at https://github.com/wuw2135/ReflexSplit.

Authors:Jialei Li, Yang Zhang, Yimeng Bai, Shuai Zhu, Ziqi Xue, Xiaoyan Zhao, Dingxian Wang, Frank Yang, Andrew Rabinovich, Xiangnan He
Title: UniGRec: Unified Generative Recommendation with Soft Identifiers for End-to-End Optimization
Abstract:
Generative recommendation has recently emerged as a transformative paradigm that directly generates target items, surpassing traditional cascaded approaches. It typically involves two components: a tokenizer that learns item identifiers and a recommender trained on them. Existing methods often decouple tokenization from recommendation or rely on asynchronous alternating optimization, limiting full end‑to‑end alignment. To address this, we unify the tokenizer and recommender under the ultimate recommendation objective via differentiable soft item identifiers, enabling joint end‑to‑end training. However, this introduces three challenges: training‑inference discrepancy due to soft‑to‑hard mismatch, item identifier collapse from codeword usage imbalance, and collaborative signal deficiency due to an overemphasis on fine‑grained token‑level semantics. To tackle these challenges, we propose UniGRec, a unified generative recommendation framework that addresses them from three perspectives. UniGRec employs Annealed Inference Alignment during tokenization to smoothly bridge soft training and hard inference, a Codeword Uniformity Regularization to prevent identifier collapse and encourage codebook diversity, and a Dual Collaborative Distillation mechanism that distills collaborative priors from a lightweight teacher model to jointly guide both the tokenizer and the recommender. Extensive experiments on real‑world datasets demonstrate that UniGRec consistently outperforms state‑of‑the‑art baseline methods. Our codes are available at https://github.com/Jialei‑03/UniGRec.

Authors:Zichuan Yang, Yiming Xing
Title: Active Hypothesis Testing for Correlated Combinatorial Anomaly Detection
Abstract:
We study the problem of identifying an anomalous subset of streams under correlated noise, motivated by monitoring and security in cyber‑physical systems. This problem can be viewed as a form of combinatorial pure exploration, where each stream plays the role of an arm and measurements must be allocated sequentially under uncertainty. Existing combinatorial bandit and hypothesis testing methods typically assume independent observations and fail to exploit correlation for efficient measurement design. We propose ECC‑AHT, an adaptive algorithm that selects continuous, constrained measurements to maximize Chernoff information between competing hypotheses, enabling active noise cancellation through differential sensing. ECC‑AHT achieves optimal sample complexity guarantees and significantly outperforms state‑of‑the‑art baselines in both synthetic and real‑world correlated environments. The code is available on https://github.com/VincentdeCristo/ECC‑AHT

Authors:Ruoqing Zheng, Chang Sun, Qibin Liu, Lauri Laatu, Arianna Cox, Benedikt Maier, Alexander Tapper, Jose G. F. Coutinho, Wayne Luk, Zhiqiang Que
Title: JetFormer: A Scalable and Efficient Transformer for Jet Tagging from Offline Analysis to FPGA Triggers
Abstract:
We present JetFormer, a versatile and scalable encoder‑only Transformer architecture for particle jet tagging at the Large Hadron Collider (LHC). Unlike prior approaches that are often tailored to specific deployment regimes, JetFormer is designed to operate effectively across the full spectrum of jet tagging scenarios, from high‑accuracy offline analysis to ultra‑low‑latency online triggering. The model processes variable‑length sets of particle features without relying on input of explicit pairwise interactions, yet achieves competitive or superior performance compared to state‑of‑the‑art methods. On the large‑scale JetClass dataset, a large‑scale JetFormer matches the accuracy of the interaction‑rich ParT model (within 0.7%) while using 37.4% fewer FLOPs, demonstrating its computational efficiency and strong generalization. On benchmark HLS4ML 150P datasets, JetFormer consistently outperforms existing models such as MLPs, Deep Sets, and Interaction Networks by 3‑4% in accuracy. To bridge the gap to hardware deployment, we further introduce a hardware‑aware optimization pipeline based on multi‑objective hyperparameter search, yielding compact variants like JetFormer‑tiny suitable for FPGA‑based trigger systems with sub‑microsecond latency requirements. Through structured pruning and quantization, we show that JetFormer can be aggressively compressed with minimal accuracy loss. By unifying high‑performance modeling and deployability within a single architectural framework, JetFormer provides a practical pathway for deploying Transformer‑based jet taggers in both offline and online environments at the LHC. Code is available at https://github.com/walkieq/JetFormer.

Authors:Yinkai Wang, Yan Zhou Chen, Xiaohui Chen, Li-Ping Liu, Soha Hassoun
Title: SpecBridge: Bridging Mass Spectrometry and Molecular Representations via Cross-Modal Alignment
Abstract:
Small‑molecule identification from tandem mass spectrometry (MS/MS) remains a bottleneck in untargeted settings where spectral libraries are incomplete. While deep learning offers a solution, current approaches typically fall into two extremes: explicit generative models that construct molecular graphs atom‑by‑atom, or joint contrastive models that learn cross‑modal subspaces from scratch. We introduce SpecBridge, a novel implicit alignment framework that treats structure identification as a geometric alignment problem. SpecBridge fine‑tunes a self‑supervised spectral encoder (DreaMS) to project directly into the latent space of a frozen molecular foundation model (ChemBERTa), and then performs retrieval by cosine similarity to a fixed bank of precomputed molecular embeddings. Across MassSpecGym, Spectraverse, and MSnLib benchmarks, SpecBridge improves top‑1 retrieval accuracy by roughly 20‑25% relative to strong neural baselines, while keeping the number of trainable parameters small. These results suggest that aligning to frozen foundation models is a practical, stable alternative to designing new architectures from scratch. The code for SpecBridge is released at https://github.com/HassounLab/SpecBridge.

Authors:Seyyed Saeid Cheshmi, Hahnemann Ortiz, James Mooney, Dongyeop Kang
Title: Reasoning Beyond Literal: Cross-style Multimodal Reasoning for Figurative Language Understanding
Abstract:
Vision‑language models (VLMs) have demonstrated strong reasoning abilities in literal multimodal tasks such as visual mathematics and science question answering. However, figurative language, such as sarcasm, humor, and metaphor, remains a significant challenge, as it conveys intent and emotion through subtle incongruities between expressed and intended meanings. In multimodal settings, accompanying images can amplify or invert textual meaning, demanding models that reason across modalities and account for subjectivity. We propose a three‑step framework for developing efficient multimodal reasoning models that can (i) interpret multimodal figurative language, (ii) provide transparent reasoning traces, and (iii) generalize across multiple figurative styles. Experiments across four styles show that (1) incorporating reasoning traces substantially improves multimodal figurative understanding, (2) reasoning learned in one style can transfer to others, especially between related styles like sarcasm and humor, and (3) training jointly across styles yields a generalized reasoning VLM that outperforms much larger open‑ and closed‑source models. Our findings show that lightweight VLMs with verifiable reasoning achieve robust cross‑style generalization while providing inspectable reasoning traces for multimodal tasks. The code and implementation are available at https://github.com/scheshmi/CrossStyle‑MMR.

Authors:Adrian Tkachenko, Sepehr Salem, Ayotomiwa Ezekiel Adeniyi, Zulal Bingol, Mohammed Nayeem Uddin, Akshat Prasanna, Alexander Zelikovsky, Serghei Mangul, Can Alkan, Mohammed Alser
Title: FASTR: Reimagining FASTQ via Compact Image-inspired Representation
Abstract:
Motivation: High‑throughput sequencing (HTS) enables population‑scale genomics but generates massive datasets, creating bottlenecks in storage, transfer, and analysis. FASTQ, the standard format for over two decades, stores one byte per base and one byte per quality score, leading to inefficient I/O, high storage costs, and redundancy. Existing compression tools can mitigate some issues, but often introduce costly decompression or complex dependency issues. Results: We introduce FASTR, a lossless, computation‑native successor to FASTQ that encodes each nucleotide together with its base quality score into a single 8‑bit value. FASTR reduces file size by at least 2x while remaining fully reversible and directly usable for downstream analyses. Applying general‑purpose compression tools on FASTR consistently yields higher compression ratios, 2.47, 3.64, and 4.8x faster compression, and 2.34, 1.96, 1.75x faster decompression than on FASTQ across Illumina, HiFi, and ONT reads. FASTR is machine‑learning‑ready, allowing reads to be consumed directly as numerical vectors or image‑like representations. We provide a highly parallel software ecosystem for FASTQ‑FASTR conversion and show that FASTR integrates with existing tools, such as minimap2, with minimal interface changes and no performance overhead. By eliminating decompression costs and reducing data movement, FASTR lays the foundation for scalable genomics analyses and real‑time sequencing workflows. Availability and Implementation: https://github.com/ALSER‑Lab/FASTR

Authors:Yonghan Jung, Bogyeong Kang
Title: Data-Driven Information-Theoretic Causal Bounds under Unmeasured Confounding
Abstract:
We develop a data‑driven information‑theoretic framework for sharp partial identification of causal effects under unmeasured confounding. Existing approaches often rely on restrictive assumptions, such as bounded or discrete outcomes; require external inputs (for example, instrumental variables, proxies, or user‑specified sensitivity parameters); necessitate full structural causal model specifications; or focus solely on population‑level averages while neglecting covariate‑conditional treatment effects. We overcome all four limitations simultaneously by establishing novel information‑theoretic, data‑driven divergence bounds. Our key theoretical contribution shows that the f‑divergence between the observational distribution P(Y | A = a, X = x) and the interventional distribution P(Y | do(A = a), X = x) is upper bounded by a function of the propensity score alone. This result enables sharp partial identification of conditional causal effects directly from observational data, without requiring external sensitivity parameters, auxiliary variables, full structural specifications, or outcome boundedness assumptions. For practical implementation, we develop a semiparametric estimator satisfying Neyman orthogonality (Chernozhukov et al., 2018), which ensures square‑root‑n consistent inference even when nuisance functions are estimated using flexible machine learning methods. Simulation studies and real‑world data applications, implemented in the GitHub repository (https://github.com/yonghanjung/Information‑Theretic‑Bounds), demonstrate that our framework provides tight and valid causal bounds across a wide range of data‑generating processes.

Authors:Inderjeet Singh, Eleonore Vissol-Gaudin, Andikan Otung, Motoyoshi Sekiya
Title: Learning to Collaborate: An Orchestrated-Decentralized Framework for Peer-to-Peer LLM Federation
Abstract:
Fine‑tuning Large Language Models (LLMs) for specialized domains is constrained by a fundamental challenge: the need for diverse, cross‑organizational data conflicts with the principles of data privacy and sovereignty. While Federated Learning (FL) provides a framework for collaboration without raw data exchange, its classic centralized form introduces a single point of failure and remains vulnerable to model inversion attacks. Decentralized FL (DFL) mitigates this risk by removing the central aggregator but typically relies on inefficient, random peer‑to‑peer (P2P) pairings, forming a collaboration graph that is blind to agent heterogeneity and risks negative transfer. This paper introduces KNEXA‑FL, a novel framework for orchestrated decentralization that resolves this trade‑off. KNEXA‑FL employs a non‑aggregating Central Profiler/Matchmaker (CPM) that formulates P2P collaboration as a contextual bandit problem, using a LinUCB algorithm on abstract agent profiles to learn an optimal matchmaking policy. It orchestrates direct knowledge exchange between heterogeneous, PEFT‑based LLM agents via secure distillation, without ever accessing the models themselves. Our comprehensive experiments on a challenging code generation task show that KNEXA‑FL yields substantial gains, improving Pass@1 by approx. 50% relative to random P2P collaboration. Critically, our orchestrated approach demonstrates stable convergence, in stark contrast to a powerful centralized distillation baseline which suffers from catastrophic performance collapse. Our work establishes adaptive, learning‑based orchestration as a foundational principle for building robust and effective decentralized AI ecosystems.

Authors:Ole Stüven, Keno Moenck, Thorsten Schüppstuhl
Title: CUROCKET: Optimizing ROCKET for GPU
Abstract:
ROCKET (RandOm Convolutional KErnel Transform) is a feature extraction algorithm created for Time Series Classification (TSC), published in 2019. It applies convolution with randomly generated kernels on a time series, producing features that can be used to train a linear classifier or regressor like Ridge. At the time of publication, ROCKET was on par with the best state‑of‑the‑art algorithms for TSC in terms of accuracy while being significantly less computationally expensive, making ROCKET a compelling algorithm for TSC. This also led to several subsequent versions, further improving accuracy and computational efficiency. The currently available ROCKET implementations are mostly bound to execution on CPU. However, convolution is a task that can be highly parallelized and is therefore suited to be executed on GPU, which speeds up the computation significantly. A key difficulty arises from the inhomogeneous kernels ROCKET uses, making standard methods for applying convolution on GPU inefficient. In this work, we propose an algorithm that is able to efficiently perform ROCKET on GPU and achieves up to 11 times higher computational efficiency per watt than ROCKET on CPU. The code for CUROCKET is available in this repository https://github.com/oleeven/CUROCKET on github.

Authors:Haoxuan Li, He Chang, Yunshan Ma, Yi Bin, Yang Yang, See-Kiong Ng, Tat-Seng Chua
Title: ThinkTank-ME: A Multi-Expert Framework for Middle East Event Forecasting
Abstract:
Event forecasting is inherently influenced by multifaceted considerations, including international relations, regional historical dynamics, and cultural contexts. However, existing LLM‑based approaches employ single‑model architectures that generate predictions along a singular explicit trajectory, constraining their ability to capture diverse geopolitical nuances across complex regional contexts. To address this limitation, we introduce ThinkTank‑ME, a novel Think Tank framework for Middle East event forecasting that emulates collaborative expert analysis in real‑world strategic decision‑making. To facilitate expert specialization and rigorous evaluation, we construct POLECAT‑FOR‑ME, a Middle East‑focused event forecasting benchmark. Experimental results demonstrate the superiority of multi‑expert collaboration in handling complex temporal geopolitical forecasting tasks. The code is available at https://github.com/LuminosityX/ThinkTank‑ME.

Authors:Wei Zhou, Jun Zhou, Haoyu Wang, Zhenghao Li, Qikang He, Shaokun Han, Guoliang Li, Xuanhe Zhou, Yeye He, Chunwei Liu, Zirui Tang, Bin Wang, Shen Tang, Kai Zuo, Yuyu Luo, Zhenzhe Zheng, Conghui He, Jingren Zhou, Fan Wu
Title: Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
Abstract:
Data preparation aims to denoise raw datasets, uncover cross‑dataset relationships, and extract valuable insights from them, which is essential for a wide range of data‑centric applications. Driven by (i) rising demands for application‑ready data (e.g., for analytics, visualization, decision‑making), (ii) increasingly powerful LLM techniques, and (iii) the emergence of infrastructures that facilitate flexible agent construction (e.g., using Databricks Unity Catalog), LLM‑enhanced methods are rapidly becoming a transformative and potentially dominant paradigm for data preparation. By investigating hundreds of recent literature works, this paper presents a systematic review of this evolving landscape, focusing on the use of LLM techniques to prepare data for diverse downstream tasks. First, we characterize the fundamental paradigm shift, from rule‑based, model‑specific pipelines to prompt‑driven, context‑aware, and agentic preparation workflows. Next, we introduce a task‑centric taxonomy that organizes the field into three major tasks: data cleaning (e.g., standardization, error processing, imputation), data integration (e.g., entity matching, schema matching), and data enrichment (e.g., data annotation, profiling). For each task, we survey representative techniques, and highlight their respective strengths (e.g., improved generalization, semantic understanding) and limitations (e.g., the prohibitive cost of scaling LLMs, persistent hallucinations even in advanced agents, the mismatch between advanced methods and weak evaluation). Moreover, we analyze commonly used datasets and evaluation metrics (the empirical part). Finally, we discuss open research challenges and outline a forward‑looking roadmap that emphasizes scalable LLM‑data systems, principled designs for reliable agentic workflows, and robust evaluation protocols.

Authors:Basile Van Hoorick, Dian Chen, Shun Iwase, Pavel Tokmakov, Muhammad Zubair Irshad, Igor Vasiljevic, Swati Gupta, Fangzhou Cheng, Sergey Zakharov, Vitor Campagnolo Guizilini
Title: AnyView: Synthesizing Any Novel View in Dynamic Scenes
Abstract:
Modern generative video models excel at producing convincing, high‑quality outputs, but struggle to maintain multi‑view and spatiotemporal consistency in highly dynamic real‑world environments. In this work, we introduce AnyView, a diffusion‑based video generation framework for \emphdynamic view synthesis with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi‑view static (3D) and multi‑view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero‑shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose AnyViewBench, a challenging new benchmark tailored towards \emphextreme dynamic view synthesis in diverse real‑world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emphany viewpoint. Results, data, code, and models can be viewed at: https://tri‑ml.github.io/AnyView/

Authors:Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang
Title: E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory
Abstract:
Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on every edge. To overcome this, we introduce E2Former‑V2, a scalable architecture that integrates algebraic sparsity with hardware‑aware execution. We first propose Equivariant Axis‑Aligned Sparsification (EAAS). EAAS builds on Wigner‑6j convolution by exploiting an \mathrmSO(3) \rightarrow \mathrmSO(2) change of basis to transform computationally expensive dense tensor contractions into efficient, sparse parity re‑indexing operations. Building on this representation, we introduce On‑the‑Fly Equivariant Attention, a fully node‑centric mechanism implemented via a custom fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves a 20× improvement in TFLOPS compared to standard implementations. Extensive experiments on the SPICE and OMol25 datasets demonstrate that E2Former‑V2 maintains comparable predictive performance while notably accelerating inference. This work demonstrates that large equivariant transformers can be trained efficiently using widely accessible GPU platforms. The code is avalible at https://github.com/IQuestLab/UBio‑MolFM/tree/e2formerv2.

Authors:Erik Wallin, Fredrik Kahl, Lars Hammarstrand
Title: Semi-Supervised Hierarchical Open-Set Classification
Abstract:
Hierarchical open‑set classification handles previously unseen classes by assigning them to the most appropriate high‑level category in a class taxonomy. We extend this paradigm to the semi‑supervised setting, enabling the use of large‑scale, uncurated datasets containing a mixture of known and unknown classes to improve the hierarchical open‑set performance. To this end, we propose a teacher‑student framework based on pseudo‑labeling. Two key components are introduced: 1) subtree pseudo‑labels, which provide reliable supervision in the presence of unknown data, and 2) age‑gating, a mechanism that mitigates overconfidence in pseudo‑labels. Experiments show that our framework outperforms self‑supervised pretraining followed by supervised adaptation, and even matches the fully supervised counterpart when using only 20 labeled samples per class on the iNaturalist19 benchmark. Our code is available at https://github.com/walline/semihoc.

Authors:Feixiang Zheng, Yu Wu, Cecilia Mascolo, Ting Dang
Title: Rethinking Large Language Models For Irregular Time Series Classification In Critical Care
Abstract:
Time series data from the Intensive Care Unit (ICU) provides critical information for patient monitoring. While recent advancements in applying Large Language Models (LLMs) to time series modeling (TSM) have shown great promise, their effectiveness on the irregular ICU data, characterized by particularly high rates of missing values, remains largely unexplored. This work investigates two key components underlying the success of LLMs for TSM: the time series encoder and the multimodal alignment strategy. To this end, we establish a systematic testbed to evaluate their impact across various state‑of‑the‑art LLM‑based methods on benchmark ICU datasets against strong supervised and self‑supervised baselines. Results reveal that the encoder design is more critical than the alignment strategy. Encoders that explicitly model irregularity achieve substantial performance gains, yielding an average AUPRC increase of 12.8% over the vanilla Transformer. While less impactful, the alignment strategy is also noteworthy, with the best‑performing semantically rich, fusion‑based strategy achieving a modest 2.9% improvement over cross‑attention. However, LLM‑based methods require at least 10× longer training than the best‑performing irregular supervised models, while delivering only comparable performance. They also underperform in data‑scarce few‑shot learning settings. These findings highlight both the promise and current limitations of LLMs for irregular ICU time series. The code is available at https://github.com/mHealthUnimelb/LLMTS.

Authors:Hannah Cyberey, Yangfeng Ji, David Evans
Title: White-Box Sensitivity Auditing with Steering Vectors
Abstract:
Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black‑box evaluations that assess model behavior only through input‑output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text‑based inputs alone. To address these limitations, we propose a white‑box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model's intended function for the task. We demonstrate its application to bias audits in four simulated high‑stakes LLM decision tasks. Our method consistently reveals substantial dependence on protected attributes in model predictions, even in settings where standard black‑box evaluations suggest little or no bias. Our code is openly available at https://github.com/hannahxchen/llm‑steering‑audit

Authors:Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, Hyeonho Jeong
Title: Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory
Abstract:
Recent foundational video‑to‑video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real‑world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi‑turn setting, current video editors struggle to maintain cross‑consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross‑consistency in multi‑turn video editing and introduce Memory‑V2V, a simple, yet effective framework that augments existing video‑to‑video models with explicit memory. Given an external cache of previously edited videos, Memory‑V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory‑V2V on challenging tasks including video novel view synthesis and text‑conditioned long video editing. Extensive experiments show that Memory‑V2V produces videos that are significantly more cross‑consistent with minimal computational overhead, while maintaining or even improving task‑specific performance over state‑of‑the‑art baselines. Project page: https://dohunlee1.github.io/MemoryV2V

Authors:Bing Xu, Terry Chen, Fengzhe Zhou, Tianqi Chen, Yangqing Jia, Vinod Grover, Haicheng Wu, Wei Liu, Craig Wittenbrink, Wen-mei Hwu, Roger Bringmann, Ming-Yu Liu, Luis Ceze, Michael Lightstone, Humphrey Shi
Title: VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
Abstract:
VIBETENSOR is an open‑source research system software stack for deep learning, generated by LLM‑powered coding agents under high‑level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent‑proposed diffs; validation relied on agent‑run builds, tests, and differential checks, without per‑change manual diff review. It implements a PyTorch‑style eager tensor library with a C++20 core (CPU+CUDA), a torch‑like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema‑lite dispatcher, reverse‑mode autograd, CUDA runtime (streams/events/graphs), a stream‑ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI‑assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test‑suite composition, and summarize reproducible microbenchmarks from an accompanying AI‑generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end‑to‑end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell‑class GPUs; multi‑GPU results are Blackwell‑only and use an optional CUTLASS‑based ring‑allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.

Authors:Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
Title: Learning to Discover at Test Time
Abstract:
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test‑time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test‑Time Training to Discover (TTT‑Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT‑Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2× faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single‑cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt‑oss‑120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test‑time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

Authors:Dong Xu, Jiantao Wu, Qihua Pan, Sisi Yuan, Zexuan Zhu, Junkai Ji
Title: Rethinking Drug-Drug Interaction Modeling as Generalizable Relation Learning
Abstract:
Drug‑drug interaction (DDI) prediction is central to drug discovery and clinical development, particularly in the context of increasingly prevalent polypharmacy. Although existing computational methods achieve strong performance on standard benchmarks, they often fail to generalize to realistic deployment scenarios, where most candidate drug pairs involve previously unseen drugs and validated interactions are scarce. We demonstrate that proximity in the embedding spaces of prevailing molecule‑centric DDI models does not reliably correspond to interaction labels, and that simply scaling up model capacity therefore fails to improve generalization. To address these limitations, we propose GenRel‑DDI, a generalizable relation learning framework that reformulates DDI prediction as a relation‑centric learning problem, in which interaction representations are learned independently of drug identities. This relation‑level abstraction enables the capture of transferable interaction patterns that generalize to unseen drugs and novel drug pairs. Extensive experiments across multiple benchmark demonstrate that GenRel‑DDI consistently and significantly outperforms state‑of‑the‑art methods, with particularly large gains on strict entity‑disjoint evaluations, highlighting the effectiveness and practical utility of relation learning for robust DDI prediction. The code is available at https://github.com/SZU‑ADDG/GenRel‑DDI.

Authors:Yang Yu, Peiyu Zang, Chi Hsu Tsai, Haiming Wu, Yixin Shen, Jialing Zhang, Haoyu Wang, Zhiyou Xiao, Jingze Shi, Yuyu Luo, Wentao Zhang, Chunlei Men, Guang Liu, Yonghua Lin
Title: Towards Automated Kernel Generation in the Era of LLMs
Abstract:
The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high‑level algorithmic semantics into low‑level hardware operations. Achieving near‑optimal kernels requires expert‑level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time‑consuming and non‑scalable process. Recent advances in large language models (LLMs) and LLM‑based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well‑suited to compress expert‑level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback‑driven loop. Rapid progress has been made in this area. However, the field remains fragmented, lacking a systematic perspective for LLM‑driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM‑based approaches and agentic optimization workflows, and systematically compiling the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open‑source GitHub repository at https://github.com/flagos‑ai/awesome‑LLM‑driven‑kernel‑generation.

Authors:Jingjing Bai, Yoshinobu Kawahara
Title: Dualformer: Time-Frequency Dual Domain Learning for Long-term Time Series Forecasting
Abstract:
Transformer‑based models, despite their promise for long‑term time series forecasting (LTSF), suffer from an inherent low‑pass filtering effect that limits their effectiveness. This issue arises due to undifferentiated propagation of frequency components across layers, causing a progressive attenuation of high‑frequency information crucial for capturing fine‑grained temporal variations. To address this limitation, we propose Dualformer, a principled dual‑domain framework that rethinks frequency modeling from a layer‑wise perspective. Dualformer introduces three key components: (1) a dual‑branch architecture that concurrently models complementary temporal patterns in both time and frequency domains; (2) a hierarchical frequency sampling module that allocates distinct frequency bands to different layers, preserving high‑frequency details in lower layers while modeling low‑frequency trends in deeper layers; and (3) a periodicity‑aware weighting mechanism that dynamically balances contributions from the dual branches based on the harmonic energy ratio of inputs, supported theoretically by a derived lower bound. This design enables structured frequency modeling and adaptive integration of time‑frequency features, effectively preserving high‑frequency information and enhancing generalization. Extensive experiments conducted on eight widely used benchmarks demonstrate Dualformer's robustness and superior performance, particularly on heterogeneous or weakly periodic data. Our code is publicly available at https://github.com/Akira‑221/Dualformer.

Authors:Huayu Li, ZhengXiao He, Siyuan Tian, Jinghao Wen, Ao Li
Title: Martingale Foresight Sampling: A Principled Approach to Inference-Time LLM Decoding
Abstract:
Standard autoregressive decoding in large language models (LLMs) is inherently short‑sighted, often failing to find globally optimal reasoning paths due to its token‑by‑token generation process. While inference‑time strategies like foresight sampling attempt to mitigate this by simulating future steps, they typically rely on ad‑hoc heuristics for valuing paths and pruning the search space. This paper introduces Martingale Foresight Sampling (MFS), a principled framework that reformulates LLM decoding as a problem of identifying an optimal stochastic process. By modeling the quality of a reasoning path as a stochastic process, we leverage Martingale theory to design a theoretically‑grounded algorithm. Our approach replaces heuristic mechanisms with principles from probability theory: step valuation is derived from the Doob Decomposition Theorem to measure a path's predictable advantage, path selection uses Optional Stopping Theory for principled pruning of suboptimal candidates, and an adaptive stopping rule based on the Martingale Convergence Theorem terminates exploration once a path's quality has provably converged. Experiments on six reasoning benchmarks demonstrate that MFS surpasses state‑of‑the‑art methods in accuracy while significantly improving computational efficiency. Code will be released at https://github.com/miraclehetech/EACL2026‑Martingale‑Foresight‑Sampling.

Authors:Md Nabi Newaz Khan, Abdullah Arafat Miah, Yu Bi
Title: Multi-Targeted Graph Backdoor Attack
Abstract:
Graph neural network (GNN) have demonstrated exceptional performance in solving critical problems across diverse domains yet remain susceptible to backdoor attacks. Existing studies on backdoor attack for graph classification are limited to single target attack using subgraph replacement based mechanism where the attacker implants only one trigger into the GNN model. In this paper, we introduce the first multi‑targeted backdoor attack for graph classification task, where multiple triggers simultaneously redirect predictions to different target labels. Instead of subgraph replacement, we propose subgraph injection which preserves the structure of the original graphs while poisoning the clean graphs. Extensive experiments demonstrate the efficacy of our approach, where our attack achieves high attack success rates for all target labels with minimal impact on the clean accuracy. Experimental results on five dataset demonstrate the superior performance of our attack framework compared to the conventional subgraph replacement‑based attack. Our analysis on four GNN models confirms the generalization capability of our attack which is effective regardless of the GNN model architectures and training parameters settings. We further investigate the impact of the attack design parameters including injection methods, number of connections, trigger sizes, trigger edge density and poisoning ratios. Additionally, our evaluation against state‑of‑the‑art defenses (randomized smoothing and fine‑pruning) demonstrates the robustness of our proposed multi‑target attacks. This work highlights the GNN vulnerability against multi‑targeted backdoor attack in graph classification task. Our source codes will be available at https://github.com/SiSL‑URI/Multi‑Targeted‑Graph‑Backdoor‑Attack.

Authors:Fahd Seddik, Abdulrahman Elbedewy, Gaser Sami, Mohamed Abdelmoniem, Yahia Zakaria
Title: Panther: Faster and Cheaper Computations with Randomized Numerical Linear Algebra
Abstract:
Training modern deep learning models is increasingly constrained by GPU memory and compute limits. While Randomized Numerical Linear Algebra (RandNLA) offers proven techniques to compress these models, the lack of a unified, production‑grade library prevents widely adopting these methods. We present Panther, a PyTorch‑compatible library that consolidates established RandNLA algorithms into a single high‑performance framework. Panther engineers efficient, drop‑in replacements for standard components including sketched linear layers, 2D convolution, multi‑head attention, and randomized matrix decompositions (such as pivoted CholeskyQR). By implementing a custom C++/CUDA backend (pawX), Panther provides an optimized implementation that can run on both CPUs and GPUs. We demonstrate the effectiveness of RandNLA techniques and Panther's ease of adoption. By replacing standard PyTorch linear layers with Panther layers (requiring only a few lines of code) we achieve significant memory savings (up to 75%) on BERT while maintaining comparable loss. Source code is available (MIT License) at https://github.com/FahdSeddik/panther, along with demonstration video at https://youtu.be/7M3RQb4KWxs.

Authors:Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem
Title: CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Abstract:
Medical vision‑language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error‑aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine‑tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy‑grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data‑efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma‑4b‑it‑cure

Authors:Francesca Pia Panaccione, Carlo Sgaravatti, Pietro Pinoli
Title: GeMM-GAN: A Multimodal Generative Model Conditioned on Histopathology Images and Clinical Descriptions for Gene Expression Profile Generation
Abstract:
Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM‑GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM‑GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11% the accuracy on downstream disease type prediction compared to current state‑of‑the‑art generative models. Code will be available at: https://github.com/francescapia/GeMM‑GAN

Authors:Rishit Chugh
Title: RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models
Abstract:
The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy‑violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient‑based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource‑efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre‑trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm‑related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red‑teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.

Authors:Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
Title: The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Abstract:
Diffusion Large Language Models (dLLMs) break the rigid left‑to‑right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter‑intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high‑uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl‑thu.github.io/the‑flexibility‑trap

Authors:Bostan Khan, Masoud Daneshtalab
Title: Predictor-Free and Hardware-Aware Federated Neural Architecture Search via Pareto-Guided Supernet Training
Abstract:
Federated Neural Architecture Search (FedNAS) aims to automate model design for privacy‑preserving Federated Learning (FL) but currently faces two critical bottlenecks: unguided supernet training that yields suboptimal models, and costly multi‑hour pipelines for post‑training subnet discovery. We introduce DeepFedNAS, a novel, two‑phase framework underpinned by a multi‑objective fitness function that synthesizes mathematical network design with architectural heuristics. Enabled by a re‑engineered supernet, DeepFedNAS introduces Federated Pareto Optimal Supernet Training, which leverages a pre‑computed Pareto‑optimal cache of high‑fitness architectures as an intelligent curriculum to optimize shared supernet weights. Subsequently, its Predictor‑Free Search Method eliminates the need for costly accuracy surrogates by utilizing this fitness function as a direct, zero‑cost proxy for accuracy, enabling on‑demand subnet discovery in mere seconds. DeepFedNAS achieves state‑of‑the‑art accuracy (e.g., up to 1.21% absolute improvement on CIFAR‑100), superior parameter and communication efficiency, and a substantial ~61x speedup in total post‑training search pipeline time. By reducing the pipeline from over 20 hours to approximately 20 minutes (including initial cache generation) and enabling 20‑second individual subnet searches, DeepFedNAS makes hardware‑aware FL deployments instantaneous and practical. The complete source code and experimental scripts are available at: https://github.com/bostankhan6/DeepFedNAS

Authors:Oleg Shchendrigin, Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov
Title: Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning
Abstract:
Effective decision‑making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift. Existing Reinforcement Learning (RL) benchmarks and memory‑augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored. To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, i.e. the natural setting where an agent must rely on memory rather than current observations, and use it to compare recurrent, transformer‑based, and structured memory architectures. Our experiments reveal that classic recurrent models, despite their simplicity, demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer‑based agents, which often fail beyond trivial retention cases. These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating. Our work highlights this overlooked challenge, introduces benchmarks to evaluate it, and offers insights for designing future RL agents with explicit and trainable forgetting mechanisms. Code: https://quartz‑admirer.github.io/Memory‑Rewriting/

Authors:Adam Rokah, Daniel Veress, Caleb Caulk, Sourav Sharan
Title: Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization
Abstract:
Mixture‑of‑Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian‑based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non‑local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature‑based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.

Authors:Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz
Title: Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control
Abstract:
Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical setups, and evaluation protocols. Current AFC benchmarks attempt to address these issues but heavily rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi‑agent support. To overcome these limitations, we introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC. Built entirely in PyTorch on top of the GPU‑accelerated PICT solver, FluidGym runs in a single Python stack, requires no external CFD software, and provides standardized evaluation protocols. We present baseline results with PPO and SAC and release all environments, datasets, and trained models as public resources. FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future research in learning‑based flow control, and is available at https://github.com/safe‑autonomous‑systems/fluidgym.

Authors:Michael Feil, Julius Lipp
Title: RadixMLP -- Intra-batch Deduplication for Causal Transformers
Abstract:
Batch inference workloads for causal transformer models frequently process sequences that share common prefixes, such as system prompts, few‑shot examples, or shared queries. Standard inference engines treat each sequence independently, redundantly recomputing identical MLP activations for every copy of the shared prefix. We introduce RadixMLP, a technique that exploits the position‑wise nature of MLPs, LayerNorms, linear projections, and embeddings to eliminate this redundancy. RadixMLP dynamically maps batches to a prefix trie, gathering shared segments into a compressed representation for position‑wise computation and scattering results back only at attention boundaries. RadixMLP is stateless and operates within a single forward pass. In end‑to‑end serving benchmarks on MS~MARCO v1.1 with Qwen3 models (0.6B to 8B parameters), RadixMLP achieves 1.44‑1.59× speedups in realistic reranking workloads, with up to 5× speedups on synthetic benchmarks with longer shared prefixes. Our code is available at https://github.com/michaelfeil/radix‑mlp.

Authors:Harry Mead, Bruno Lacerda, Jakob Foerster, Nick Hawes
Title: Improving Regret Approximation for Unsupervised Dynamic Environment Generation
Abstract:
Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero‑shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: https://github.com/HarryMJMead/Dynamic‑Environment‑Generation‑for‑UED.

Authors:Zhihao Chen, Zirui Gong, Jianting Ning, Yanjun Zhang, Leo Yu Zhang
Title: Beyond Denial-of-Service: The Puppeteer's Attack for Fine-Grained Control in Ranking-Based Federated Learning
Abstract:
Federated Rank Learning (FRL) is a promising Federated Learning (FL) paradigm designed to be resilient against model poisoning attacks due to its discrete, ranking‑based update mechanism. Unlike traditional FL methods that rely on model updates, FRL leverages discrete rankings as a communication parameter between clients and the server. This approach significantly reduces communication costs and limits an adversary's ability to scale or optimize malicious updates in the continuous space, thereby enhancing its robustness. This makes FRL particularly appealing for applications where system security and data privacy are crucial, such as web‑based auction and bidding platforms. While FRL substantially reduces the attack surface, we demonstrate that it remains vulnerable to a new class of local model poisoning attack, i.e., fine‑grained control attacks. We introduce the Edge Control Attack (ECA), the first fine‑grained control attack tailored to ranking‑based FL frameworks. Unlike conventional denial‑of‑service (DoS) attacks that cause conspicuous disruptions, ECA enables an adversary to precisely degrade a competitor's accuracy to any target level while maintaining a normal‑looking convergence trajectory, thereby avoiding detection. ECA operates in two stages: (i) identifying and manipulating Ascending and Descending Edges to align the global model with the target model, and (ii) widening the selection boundary gap to stabilize the global model at the target accuracy. Extensive experiments across seven benchmark datasets and nine Byzantine‑robust aggregation rules (AGRs) show that ECA achieves fine‑grained accuracy control with an average error of only 0.224%, outperforming the baseline by up to 17x. Our findings highlight the need for stronger defenses against advanced poisoning attacks. Our code is available at: https://github.com/Chenzh0205/ECA

Authors:Po-Kai Chiu, Hung-Hsuan Chen
Title: From Volumes to Slices: Computationally Efficient Contrastive Learning for Sequential Abdominal CT Analysis
Abstract:
The requirement for expert annotations limits the effectiveness of deep learning for medical image analysis. Although 3D self‑supervised methods like volume contrast learning (VoCo) are powerful and partially address the labeling scarcity issue, their high computational cost and memory consumption are barriers. We propose 2D‑VoCo, an efficient adaptation of the VoCo framework for slice‑level self‑supervised pre‑training that learns spatial‑semantic features from unlabeled 2D CT slices via contrastive learning. The pre‑trained CNN backbone is then integrated into a CNN‑LSTM architecture to classify multi‑organ injuries. In the RSNA 2023 Abdominal Trauma dataset, 2D‑VoCo pre‑training significantly improves mAP, precision, recall, and RSNA score over training from scratch. Our framework provides a practical method to reduce the dependency on labeled data and enhance model performance in clinical CT analysis. We release the code for reproducibility. https://github.com/tkz05/2D‑VoCo‑CT‑Classifier

Authors:Andrew Crossman, Jonah Dodd, Viralam Ramamurthy Chaithanya Kumar, Riyaz Mohammed, Andrew R. Plummer, Chandra Sekharudu, Deepak Warrier, Mohammad Yekrangian
Title: Constructing Multi-label Hierarchical Classification Models for MITRE ATT&CK Text Tagging
Abstract:
MITRE ATT&CK is a cybersecurity knowledge base that organizes threat actor and cyber‑attack information into a set of tactics describing the reasons and goals threat actors have for carrying out attacks, with each tactic having a set of techniques that describe the potential methods used in these attacks. One major application of ATT&CK is the use of its tactic and technique hierarchy by security specialists as a framework for annotating cyber‑threat intelligence reports, vulnerability descriptions, threat scenarios, inter alia, to facilitate downstream analyses. To date, the tagging process is still largely done manually. In this technical note, we provide a stratified "task space" characterization of the MITRE ATT&CK text tagging task for organizing previous efforts toward automation using AIML methods, while also clarifying pathways for constructing new methods. To illustrate one of the pathways, we use the task space strata to stage‑wise construct our own multi‑label hierarchical classification models for the text tagging task via experimentation over general cyber‑threat intelligence text ‑‑ using shareable computational tools and publicly releasing the models to the security community (via https://github.com/jpmorganchase/MITRE_models). Our multi‑label hierarchical approach yields accuracy scores of roughly 94% at the tactic level, as well as accuracy scores of roughly 82% at the technique level. The models also meet or surpass state‑of‑the‑art performance while relying only on classical machine learning methods ‑‑ removing any dependence on LLMs, RAG, agents, or more complex hierarchical approaches. Moreover, we show that GPT‑4o model performance at the tactic level is significantly lower (roughly 60% accuracy) than our own approach. We also extend our baseline model to a corpus of threat scenarios for financial applications produced by subject matter experts.

Authors:Deming Chen, Vijay Ganesh, Weikai Li, Yingyan Celine Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun
Title: Report for NSF Workshop on AI for Electronic Design Automation
Abstract:
This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI‑spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.‑can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high‑level and logic‑level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM‑assisted verification tools, ML‑augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next‑generation hardware systems. The workshop information can be found on the website https://ai4eda‑workshop.github.io/.

Authors:Alistair Cheong, Haolin Cong, Tyler Yang, Dustin Miao
Title: Search over Self-Edit Strategies for LLM Adaptation
Abstract:
Many LLM‑based open‑ended search systems freeze the foundation model that proposes improvements to existing solutions, which may bottleneck long‑run progress. Recent work has explored updating the proposal model at test time [arXiv:2511.23473], but the update strategy is still typically hand‑specified. Therefore, this study investigated whether an LLM can use task feedback to decide how it should update its weights. For tractability, we focused on the simpler case where there is only one round of self‑improvement, and restricted the update operator to self‑supervised next token prediction (NTP), leaving the model freedom in choosing its training data and key NTP hyperparameters. Using the Self‑Adapting Language Models (SEAL) [arXiv:2506.10943] framework as a testbed, we relaxed its fixed human template constraint and allowed the model to generate its own self‑edit templates, thereby giving it more control over its training data and hyperparameters. Two variants were studied, differing in whether template generation was conditioned on a lightweight archive of past templates. In SEAL's Single‑Passage Knowledge Incorporation setting with Qwen3‑8B on SQuAD [arXiv:1606.05250], the no‑archive variant performed comparably to the weaker "Implications" baseline, while the archive variant outperformed "Implications" and approached the strongest human‑designed "Rewrite" baseline without surpassing it. Further analysis of collapse in the model's exploration revealed that a naive archive can confer some short‑term robustness but can also accelerate homogenization, suggesting that explicit novelty pressure may be required to consistently advance beyond carefully optimized human strategies. Our code is available at https://github.com/cheongalc/search‑self‑edit‑strategies .

Authors:Leyi Zhao, Weijie Huang, Yitong Guo, Jiang Bian, Chenghong Wang, Xuhong Zhang
Title: Large Language Model-Powered Evolutionary Code Optimization on a Phylogenetic Tree
Abstract:
Optimizing scientific computing algorithms for modern GPUs is a labor‑intensive and iterative process involving repeated code modification, benchmarking, and tuning across complex hardware and software stacks. Recent work has explored large language model (LLM)‑assisted evolutionary methods for automated code optimization, but these approaches primarily rely on outcome‑based selection and random mutation, underutilizing the rich trajectory information generated during iterative optimization. We propose PhyloEvolve, an LLM‑agent system that reframes GPU‑oriented algorithm optimization as an In‑Context Reinforcement Learning (ICRL) problem. This formulation enables trajectory‑conditioned reuse of optimization experience without model retraining. PhyloEvolve integrates Algorithm Distillation and prompt‑based Decision Transformers into an iterative workflow, treating sequences of algorithm modifications and performance feedback as first‑class learning signals. To organize optimization history, we introduce a phylogenetic tree representation that captures inheritance, divergence, and recombination among algorithm variants, enabling backtracking, cross‑lineage transfer, and reproducibility. The system combines elite trajectory pooling, multi‑island parallel exploration, and containerized execution to balance exploration and exploitation across heterogeneous hardware. We evaluate PhyloEvolve on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms, demonstrating consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods. Code is published at: https://github.com/annihi1ation/phylo_evolve

Authors:Sunghyun Kim, Seokwoo Yun, Youngseo Yun, Youngrak Lee, Sangsoo Lim
Title: MARBLE: Multi-Agent Reasoning for Bioinformatics Learning and Evolution
Abstract:
Motivation: Developing high‑performing bioinformatics models typically requires repeated cycles of hypothesis formulation, architectural redesign, and empirical validation, making progress slow, labor‑intensive, and difficult to reproduce. Although recent LLM‑based assistants can automate isolated steps, they lack performance‑grounded reasoning and stability‑aware mechanisms required for reliable, iterative model improvement in bioinformatics workflows. Results: We introduce MARBLE, an execution‑stable autonomous model refinement framework for bioinformatics models. MARBLE couples literature‑aware reference selection with structured, debate‑driven architectural reasoning among role‑specialized agents, followed by autonomous execution, evaluation, and memory updates explicitly grounded in empirical performance. Across spatial transcriptomics domain segmentation, drug‑target interaction prediction, and drug response prediction, MARBLE consistently achieves sustained performance improvements over strong baselines across multiple refinement cycles, while maintaining high execution robustness and low regression rates. Framework‑level analyses demonstrate that structured debate, balanced evidence selection, and performance‑grounded memory are critical for stable, repeatable model evolution, rather than single‑run or brittle gains. Availability: Source code, data and Supplementary Information are available at https://github.com/PRISM‑DGU/MARBLE.

Authors:Jun Liu, Leo Yu Zhang, Fengpeng Li, Isao Echizen, Jiantao Zhou
Title: Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity
Abstract:
Hard‑label black‑box settings, where only top‑1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign‑flipping hard‑label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard‑label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first‑principles understanding, we propose a new attack framework that combines a zero‑query frequency‑domain initialization with a Pattern‑Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR‑10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP‑based models. The results show that our method consistently surpasses SOTA hard‑label attacks in both attack success rate and query efficiency, particularly in low‑query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a 0% detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

Authors:Kangyu Zheng, Kai Zhang, Jiale Tan, Xuehan Chen, Yingzhou Lu, Zaixi Zhang, Lichao Sun, Marinka Zitnik, Tianfan Fu, Zhiding Liang
Title: Beyond Affinity: A Benchmark of 1D, 2D, and 3D Methods Reveals Critical Trade-offs in Structure-Based Drug Design
Abstract:
Currently, the field of structure‑based drug design is dominated by three main types of algorithms: search‑based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross‑algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand‑centric drug design methods can be used in SBDD by treating the docking function as a black‑box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure‑based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in https://github.com/zkysfls/2025‑sbdd‑benchmark

Authors:Anh-Tuan Mai, Cam-Van Thi Nguyen, Duc-Trong Le
Title: Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation
Abstract:
Multimodal emotion recognition in conversation (MERC) requires representations that effectively integrate signals from multiple modalities. These signals include modality‑specific cues, information shared across modalities, and interactions that emerge only when modalities are combined. In information‑theoretic terms, these correspond to \emphunique, \emphredundant, and \emphsynergistic contributions. An ideal representation should leverage all three, yet achieving such balance remains challenging. Recent advances in contrastive learning and augmentation‑based methods have made progress, but they often overlook the role of data preparation in preserving these components. In particular, applying augmentations directly to raw inputs or fused embeddings can blur the boundaries between modality‑unique and cross‑modal signals. To address this challenge, we propose a two‑phase framework \emphDivide and Refine (DnR). In the Divide phase, each modality is explicitly decomposed into uniqueness, pairwise redundancy, and synergy. In the Refine phase, tailored objectives enhance the informativeness of these components while maintaining their distinct roles. The refined representations are plug‑and‑play compatible with diverse multimodal pipelines. Extensive experiments on IEMOCAP and MELD demonstrate consistent improvements across multiple MERC backbones. These results highlight the effectiveness of explicitly dividing, refining, and recombining multimodal representations as a principled strategy for advancing emotion recognition. Our implementation is available at https://github.com/mattam301/DnR‑WACV2026

Authors:Niall McGuire, Yashar Moshfeghi
Title: Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio
Abstract:
Query formulation from internal information needs remains fundamentally challenging across all Information Retrieval paradigms due to cognitive complexity and physical impairments. Brain Passage Retrieval (BPR) addresses this by directly mapping EEG signals to passage representations without intermediate text translation. However, existing BPR research exclusively uses visual stimuli, leaving critical questions unanswered: Can auditory EEG enable effective retrieval for voice‑based interfaces and visually impaired users? Can training on combined EEG datasets from different sensory modalities improve performance despite severe data scarcity? We present the first systematic investigation of auditory EEG for BPR and evaluate cross‑sensory training benefits. Using dual encoder architectures with four pooling strategies (CLS, mean, max, multi‑vector), we conduct controlled experiments comparing auditory‑only, visual‑only, and combined training on the Alice (auditory) and Nieuwland (visual) datasets. Results demonstrate that auditory EEG consistently outperforms visual EEG, and cross‑sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). Critically, combined auditory EEG models surpass BM25 text baselines (MRR: 0.474 vs 0.428), establishing neural queries as competitive with traditional retrieval whilst enabling accessible interfaces. These findings validate auditory neural interfaces for IR tasks and demonstrate that cross‑sensory training addresses data scarcity whilst outperforming single‑modality approaches Code: https://github.com/NiallMcguire/Audio_BPR

Authors:Cheol-Hui Lee, Hwa-Yeon Lee, Dong-Joo Kim
Title: RL-BioAug: Label-Efficient Reinforcement Learning for Self-Supervised EEG Representation Learning
Abstract:
The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non‑stationarity of EEG signals where statistical properties change over time. To address this, we propose RL‑BioAug, a framework that leverages a label‑efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10%) of labeled data to guide the agent's policy, our method enables the encoder to learn robust representations in a strictly self‑supervised manner. Experimental results demonstrate that RL‑BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69% and 8.80% in Macro‑F1 score on the Sleep‑EDFX and CHB‑MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task‑‑for example, Time Masking with a 62% probability for sleep stage classification and Crop & Resize with a 77% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic‑based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at https://github.com/dlcjfgmlnasa/RL‑BioAug.

Authors:Gorgi Pavlov
Title: Differentiable Logic Synthesis: Spectral Coefficient Selection via Sinkhorn-Constrained Composition
Abstract:
Learning precise Boolean logic via gradient descent remains challenging: neural networks typically converge to "fuzzy" approximations that degrade under quantization. We introduce Hierarchical Spectral Composition, a differentiable architecture that selects spectral coefficients from a frozen Boolean Fourier basis and composes them via Sinkhorn‑constrained routing with column‑sign modulation. Our approach draws on recent insights from Manifold‑Constrained Hyper‑Connections (mHC), which demonstrated that projecting routing matrices onto the Birkhoff polytope preserves identity mappings and stabilizes large‑scale training. We adapt this framework to logic synthesis, adding column‑sign modulation to enable Boolean negation ‑‑ a capability absent in standard doubly stochastic routing. We validate our approach across four phases of increasing complexity: (1) For n=2 (16 Boolean operations over 4‑dim basis), gradient descent achieves 100% accuracy with zero routing drift and zero‑loss quantization to ternary masks. (2) For n=3 (10 three‑variable operations), gradient descent achieves 76% accuracy, but exhaustive enumeration over 3^8 = 6561 configurations proves that optimal ternary masks exist for all operations (100% accuracy, 39% sparsity). (3) For n=4 (10 four‑variable operations over 16‑dim basis), spectral synthesis ‑‑ combining exact Walsh‑Hadamard coefficients, ternary quantization, and MCMC refinement with parallel tempering ‑‑ achieves 100% accuracy on all operations. This progression establishes (a) that ternary polynomial threshold representations exist for all tested functions, and (b) that finding them requires methods beyond pure gradient descent as dimensionality grows. All operations enable single‑cycle combinational logic inference at 10,959 MOps/s on GPU, demonstrating viability for hardware‑efficient neuro‑symbolic logic synthesis.

Authors:Kai Wittenmayer, Sukrut Rao, Amin Parchami-Araghi, Bernt Schiele, Jonas Fischer
Title: CFM: Language-aligned Concept Foundation Model for Vision
Abstract:
Language‑aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision‑making difficult. Recent work decompose these representations into human‑interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language‑aligned concept foundation model for vision that provides fine‑grained concepts, which are human‑interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co‑occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides performance on classification, segmentation, and captioning that is competitive with opaque foundation models while providing fine‑grained, high quality concept‑based explanations. Code at https://github.com/kawi19/CFM.

Authors:Antoine Siraudin, Christopher Morris
Title: Principled Latent Diffusion for Graphs via Laplacian Autoencoders
Abstract:
Graph diffusion models achieve state‑of‑the‑art performance in graph generation but suffer from quadratic complexity in the number of nodes ‑‑ and much of their capacity is wasted modeling the absence of edges in sparse graphs. Inspired by latent diffusion in other modalities, a natural idea is to compress graphs into a low‑dimensional latent space and perform diffusion there. However, unlike images or text, graph generation requires nearly lossless reconstruction, as even a single error in decoding an adjacency matrix can render the entire sample invalid. This challenge has remained largely unaddressed. We propose LG‑Flow, a latent graph diffusion framework that directly overcomes these obstacles. A permutation‑equivariant autoencoder maps each node into a fixed‑dimensional embedding from which the full adjacency is provably recoverable, enabling near‑lossless reconstruction for both undirected graphs and DAGs. The dimensionality of this latent representation scales linearly with the number of nodes, eliminating the quadratic bottleneck and making it feasible to train larger and more expressive models. In this latent space, we train a Diffusion Transformer with flow matching, enabling efficient and expressive graph generation. Our approach achieves competitive results against state‑of‑the‑art graph diffusion models, while achieving up to 1000× speed‑up. Our code is available at https://github.com/asiraudin/LG‑Flow .

Authors:Daniel Kyselica, Jonáš Herec, Oliver Kutis, Rado Pitoňák
Title: HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection
Abstract:
Natural disaster monitoring through continuous satellite observation requires processing multi‑temporal data under strict operational constraints. This paper addresses flood detection, a critical application for hazard management, by developing an onboard change detection system that operates within the memory and computational limits of small satellites. We propose History Injection mechanism for Transformer models (HiT), that maintains historical context from previous observations while reducing data storage by over 99% of original image size. Moreover, testing on the STTORM‑CD flood dataset confirms that the HiT mechanism within the Prithvi‑tiny foundation model maintains detection accuracy compared to the bitemporal baseline. The proposed HiT‑Prithvi model achieved 43 FPS on Jetson Orin Nano, a representative onboard hardware used in nanosats. This work establishes a practical framework for satellite‑based continuous monitoring of natural disasters, supporting real‑time hazard assessment without dependency on ground‑based processing infrastructure. Architecture as well as model checkpoints is available at https://github.com/zaitra/HiT‑change‑detection

Authors:Xu Zhang, Junwei Deng, Chang Xu, Hao Li, Jiang Bian
Title: Diff-MN: Diffusion Parameterized MoE-NCDE for Continuous Time Series Generation with Irregular Observations
Abstract:
Time series generation (TSG) is widely used across domains, yet most existing methods assume regular sampling and fixed output resolutions. These assumptions are often violated in practice, where observations are irregular and sparse, while downstream applications require continuous and high‑resolution TS. Although Neural Controlled Differential Equation (NCDE) is promising for modeling irregular TS, it is constrained by a single dynamics function, tightly coupled optimization, and limited ability to adapt learned dynamics to newly generated samples from the generative model. We propose Diff‑MN, a continuous TSG framework that enhances NCDE with a Mixture‑of‑Experts (MoE) dynamics function and a decoupled architectural design for dynamics‑focused training. To further enable NCDE to generalize to newly generated samples, Diff‑MN employs a diffusion model to parameterize the NCDE temporal dynamics parameters (MoE weights), i.e., jointly learn the distribution of TS data and MoE weights. This design allows sample‑specific NCDE parameters to be generated for continuous TS generation. Experiments on ten public and synthetic datasets demonstrate that Diff‑MN consistently outperforms strong baselines on both irregular‑to‑regular and irregular‑to‑continuous TSG tasks. The code is available at the link https://github.com/microsoft/TimeCraft/tree/main/Diff‑MN.

Authors:Nickil Maveli, Antonio Vergari, Shay B. Cohen
Title: Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility
Abstract:
LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates round‑trip consistency through execution‑free, exact‑match assessment of bijection fidelity across four lossless compression algorithms. We evaluate state‑of‑the‑art Code‑LLMs under zero‑shot prompting, supervised fine‑tuning on execution traces, and iterative self‑reflection. All approaches yield only modest improvements and none closes the gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. RTCE surfaces findings invisible to existing benchmarks: models frequently pass individual forward and backward tasks yet fail the combined round‑trip, exposing mutually inconsistent internal representations; SFT and self‑reflection saturate after one revision round, indicating they cannot repair fundamental algorithmic misunderstandings; and failures persist even on simple bijections such as RLE, suggesting that algorithmic complexity is not the sole root cause.\footnoteCode and dataset are available at https://github.com/Nickil21/round‑trip‑code‑compression.

Authors:Xue Jiang, Ge Li, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Yihong Dong
Title: KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?
Abstract:
Large language models (LLMs) excel at general programming but struggle with domain‑specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain‑specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO‑BENCH, a novel benchmark designed for evaluating domain specialization methods in real‑world software development. KOCO‑BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi‑granularity evaluation tasks including domain code generation (from function‑level to project‑level with rigorous test suites) and domain knowledge understanding (via multiple‑choice Q&A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO‑BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO‑BENCH poses significant challenges to state‑of‑the‑art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN‑LM) applied, improvements remain marginal. Best‑performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO‑BENCH, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO‑bench.

Authors:Pedro M. Gordaliza, Jaume Banus, Benoît Gérin, Maxence Wynen, Nataliia Molchanova, Jonas Richiardi, Meritxell Bach Cuadra
Title: From 100,000+ images to winning the first brain MRI foundation model challenges: Sharing lessons and models
Abstract:
Developing Foundation Models for medical image analysis is essential to overcome the unique challenges of radiological tasks. The first challenges of this kind for 3D brain MRI, SSL3D and FOMO25, were held at MICCAI 2025. Our solution ranked first in tracks of both contests. It relies on a U‑Net CNN architecture combined with strategies leveraging anatomical priors and neuroimaging domain knowledge. Notably, our models trained 1‑2 orders of magnitude faster and were 10 times smaller than competing transformer‑based approaches. Models are available here: https://github.com/jbanusco/BrainFM4Challenges.

Authors:Shiyuan Li, Yixin Liu, Yu Zheng, Mei Li, Quoc Viet Hung Nguyen, Shirui Pan
Title: OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models
Abstract:
Multi‑Agent Systems (MAS) offer a powerful paradigm for solving complex problems, yet their performance is critically dependent on the design of their underlying collaboration topology. As MAS become increasingly deployed in web services (e.g., search engines), designing adaptive topologies for diverse cross‑domain user queries becomes essential. Current graph learning‑based design methodologies often adhere to a "one‑for‑one" paradigm, where a specialized model is trained for each specific task domain. This approach suffers from poor generalization to unseen domains and fails to leverage shared structural knowledge across different tasks. To address this, we propose OFA‑TAD, a one‑for‑all framework that generates adaptive collaboration graphs for any task described in natural language through a single universal model. Our approach integrates a Task‑Aware Graph State Encoder (TAGSE) that filters task‑relevant node information via sparse gating, and a Mixture‑of‑Experts (MoE) architecture that dynamically selects specialized sub‑networks to drive node and edge prediction. We employ a three‑stage training strategy: unconditional pre‑training on canonical topologies for structural priors, large‑scale conditional pre‑training on LLM‑generated datasets for task‑topology mappings, and supervised fine‑tuning on empirically validated graphs. Experiments across six diverse benchmarks show that OFA‑TAD significantly outperforms specialized one‑for‑one models, generating highly adaptive MAS topologies. Code: https://github.com/Shiy‑Li/OFA‑MAS.

Authors:Miao Xie, Siguang Chen, Chunli Lv
Title: A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits
Abstract:
Large language models (LLMs) have become powerful and widely used systems for language understanding and generation, while multi‑armed bandit (MAB) algorithms provide a principled framework for adaptive decision‑making under uncertainty. This survey explores the potential at the intersection of these two fields. As we know, it is the first survey to systematically review the bidirectional interaction between large language models and multi‑armed bandits at the component level. We highlight the bidirectional benefits: MAB algorithms address critical LLM challenges, spanning from pre‑training to retrieval‑augmented generation (RAG) and personalization. Conversely, LLMs enhance MAB systems by redefining core components such as arm definition and environment modeling, thereby improving decision‑making in sequential tasks. We analyze existing LLM‑enhanced bandit systems and bandit‑enhanced LLM systems, providing insights into their design, methodologies, and performance. Key challenges and representative findings are identified to help guide future research. An accompanying GitHub repository that indexes relevant literature is available at https://github.com/bucky1119/Awesome‑LLM‑Bandit‑Interaction.

Authors:Meng Liu, Ke Liang, Siwei Wang, Xingchen Hu, Sihang Zhou, Xinwang Liu
Title: Deep Temporal Graph Clustering: A Comprehensive Benchmark and Datasets
Abstract:
Temporal Graph Clustering (TGC) is a new task with little attention, focusing on node clustering in temporal graphs. Compared with existing static graph clustering, it can find the balance between time requirement and space requirement (Time‑Space Balance) through the interaction sequence‑based batch‑processing pattern. However, there are two major challenges that hinder the development of TGC, i.e., inapplicable clustering techniques and inapplicable datasets. To address these challenges, we propose a comprehensive benchmark, called BenchTGC. Specially, we design a BenchTGC Framework to illustrate the paradigm of temporal graph clustering and improve existing clustering techniques to fit temporal graphs. In addition, we also discuss problems with public temporal graph datasets and develop multiple datasets suitable for TGC task, called BenchTGC Datasets. According to extensive experiments, we not only verify the advantages of BenchTGC, but also demonstrate the necessity and importance of TGC task. We wish to point out that the dynamically changing and complex scenarios in real world are the foundation of temporal graph clustering. The code and data is available at: https://github.com/MGitHubL/BenchTGC.

Authors:Ishir Garg, Neel Kolhe, Andy Peng, Rohan Gopalam
Title: Fisher-Orthogonal Projected Natural Gradient Descent for Continual Learning
Abstract:
Continual learning aims to enable neural networks to acquire new knowledge on sequential tasks. However, the key challenge in such settings is to learn new tasks without catastrophically forgetting previously learned tasks. We propose the Fisher‑Orthogonal Projected Natural Gradient Descent (FOPNG) optimizer, which enforces Fisher‑orthogonal constraints on parameter updates to preserve old task performance while learning new tasks. Unlike existing methods that operate in Euclidean parameter space, FOPNG projects gradients onto the Fisher‑orthogonal complement of previous task gradients. This approach unifies natural gradient descent with orthogonal gradient methods within an information‑geometric framework. We provide theoretical analysis deriving the projected update, describe efficient and practical implementations using the diagonal Fisher, and demonstrate strong results on standard continual learning benchmarks such as Permuted‑MNIST, Split‑MNIST, Rotated‑MNIST, Split‑CIFAR10, and Split‑CIFAR100. Our code is available at https://github.com/ishirgarg/FOPNG.

Authors:Yuqi Li, Kuiye Ding, Chuanguang Yang, Szu-Yu Chen, Yingli Tian
Title: Distilling Time Series Foundation Models for Efficient Forecasting
Abstract:
Time Series foundation models (TSFMs) deliver strong forecasting performance through large‑scale pretraining, but their large parameter sizes make deployment costly. While knowledge distillation offers a natural and effective approach for model compression, techniques developed for general machine learning tasks are not directly applicable to time series forecasting due to the unique characteristics. To address this, we present DistilTS, the first distillation framework specifically designed for TSFMs. DistilTS addresses two key challenges: (1) task difficulty discrepancy, specific to forecasting, where uniform weighting makes optimization dominated by easier short‑term horizons, while long‑term horizons receive weaker supervision; and (2) architecture discrepancy, a general challenge in distillation, for which we design an alignment mechanism in the time series forecasting. To overcome these issues, DistilTS introduces horizon‑weighted objectives to balance learning across horizons, and a temporal alignment strategy that reduces architectural mismatch, enabling compact models. Experiments on multiple benchmarks demonstrate that DistilTS achieves forecasting performance comparable to full‑sized TSFMs, while reducing parameters by up to 1/150 and accelerating inference by up to 6000x. Code is available at: https://github.com/itsnotacie/DistilTS‑ICASSP2026.

Authors:Zhaochun Li, Chen Wang, Jionghao Bai, Shisheng Cui, Ge Lan, Zhou Zhao, Yue Wang
Title: Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off
Abstract:
The exploration‑exploitation (EE) trade‑off is a central challenge in reinforcement learning (RL) for large language models (LLMs). With Group Relative Policy Optimization (GRPO), training tends to be exploitation driven: entropy decreases monotonically, samples convergence, and exploration fades. Most existing fixes are sample‑centric: they seek or bonus rare samples, assuming exploration comes from novel trajectories and tokens. These heuristics depend on the "luck" of informative samples, lack principled control of the policy, and often yield limited or inconsistent gains. In this work, we are the first to introduce a distribution‑centric perspective for RL, in which exploration is always guided by a "better" target distribution, and reveal that a policy's ability to resist entropy collapse is governed by the distribution itself rather than individual samples. Building on this insight, we propose Distribution‑Centric Policy Optimization (DCPO), which reformulates entropy regulation as distribution‑level regularization. DCPO achieves controllable entropy fully on‑policy without sampling from external distributions, enabling efficient exploration while maintaining training stability. Across multiple models and seven benchmarks, DCPO improves over GRPO by about 20% on average. Overall, DCPO replaces sample‑level heuristics with distribution‑level principles, offering a theoretically grounded and flexible framework for controllable exploration and a stronger EE trade‑off. The code is available in https://github.com/597358816/DCPO.

Authors:Maab Elrashid, Anthony Deschênes, Cem Subakan, Mirco Ravanelli, Rémi Georges, Michael Morin
Title: Toward Faithful Explanations in Acoustic Anomaly Detection
Abstract:
Interpretability is essential for user trust in real‑world anomaly detection applications. However, deep learning models, despite their strong performance, often lack transparency. In this work, we study the interpretability of autoencoder‑based models for audio anomaly detection, by comparing a standard autoencoder (AE) with a mask autoencoder (MAE) in terms of detection performance and interpretability. We applied several attribution methods, including error maps, saliency maps, SmoothGrad, Integrated Gradients, GradSHAP, and Grad‑CAM. Although MAE shows a slightly lower detection, it consistently provides more faithful and temporally precise explanations, suggesting a better alignment with true anomalies. To assess the relevance of the regions highlighted by the explanation method, we propose a perturbation‑based faithfulness metric that replaces them with their reconstructions to simulate normal input. Our findings, based on experiments in a real industrial scenario, highlight the importance of incorporating interpretability into anomaly detection pipelines and show that masked training improves explanation quality without compromising performance.

Authors:Younes Bouhadjar, Maxime Fabre, Felix Schmidt, Emre Neftci
Title: Dissecting Linear Recurrent Models: How Different Gating Strategies Drive Selectivity and Generalization
Abstract:
Linear recurrent neural networks have emerged as efficient alternatives to the original Transformer's softmax attention mechanism, thanks to their highly parallelizable training and constant memory and computation requirements at inference. Iterative refinements of these models have introduced an increasing number of architectural mechanisms, leading to increased complexity and computational costs. Nevertheless, systematic direct comparisons among these models remain limited. Existing benchmark tasks are either too simplistic to reveal substantial differences or excessively resource‑intensive for experimentation. In this work, we propose a refined taxonomy of linear recurrent models and introduce SelectivBench, a set of lightweight and customizable synthetic benchmark tasks for systematically evaluating sequence models. SelectivBench specifically evaluates selectivity in sequence models at small to medium scale, such as the capacity to focus on relevant inputs while ignoring context‑based distractors. It employs rule‑based grammars to generate sequences with adjustable complexity, incorporating irregular gaps that intentionally violate transition rules. Evaluations of linear recurrent models on SelectivBench reveal performance patterns consistent with results from large‑scale language tasks. Our analysis clarifies the roles of essential architectural features: gating and rapid forgetting mechanisms facilitate recall, in‑state channel mixing is unnecessary for selectivity, but critical for generalization, and softmax attention remains dominant due to its memory capacity scaling with sequence length. Our benchmark enables targeted, efficient exploration of linear recurrent models and provides a controlled setting for studying behaviors observed in large‑scale evaluations. Code is available at https://github.com/symseqbench/selectivbench

Authors:Ruo Qi, Linhui Dai, Yusong Qin, Chaolei Yang, Yanshan Li
Title: SDCoNet: Saliency-Driven Multi-Task Collaborative Network for Remote Sensing Object Detection
Abstract:
In remote sensing images, complex backgrounds, weak object signals, and small object scales make accurate detection particularly challenging, especially under low‑quality imaging conditions. A common strategy is to integrate single‑image super‑resolution (SR) before detection; however, such serial pipelines often suffer from misaligned optimization objectives, feature redundancy, and a lack of effective interaction between SR and detection. To address these issues, we propose a Saliency‑Driven multi‑task Collaborative Network (SDCoNet) that couples SR and detection through implicit feature sharing while preserving task specificity. SDCoNet employs the swin transformer‑based shared encoder, where hierarchical window‑shifted self‑attention supports cross‑task feature collaboration and adaptively balances the trade‑off between texture refinement and semantic representation. In addition, a multi‑scale saliency prediction module produces importance scores to select key tokens, enabling focused attention on weak object regions, suppression of background clutter, and suppression of adverse features introduced by multi‑task coupling. Furthermore, a gradient routing strategy is introduced to mitigate optimization conflicts. It first stabilizes detection semantics and subsequently routes SR gradients along a detection‑oriented direction, enabling the framework to guide the SR branch to generate high‑frequency details that are explicitly beneficial for detection. Experiments on public datasets, including NWPU VHR‑10‑Split, DOTAv1.5‑Split, and HRSSD‑Split, demonstrate that the proposed method, while maintaining competitive computational efficiency, significantly outperforms existing mainstream algorithms in small object detection on low‑quality remote sensing images. Our code is available at https://github.com/qiruo‑ya/SDCoNet.

Authors:Chun-Yi Kuan, Hung-yi Lee
Title: AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
Abstract:
Recent advances in audio‑aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real‑world settings, where questions may be misleading, ill‑posed, or incompatible with the information. To address this gap, we present AQUA‑Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA‑Bench offers a rigorous measure of model reliability and promotes the development of audio‑language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio‑language understanding.

Authors:Anzhe Cheng, Shukai Duan, Shixuan Li, Chenzhong Yin, Mingxi Cheng, Shahin Nazarian, Paul Thompson, Paul Bogdan
Title: EMoE: Eigenbasis-Guided Routing for Mixture-of-Experts
Abstract:
The relentless scaling of deep learning models has led to unsustainable computational demands, positioning Mixture‑of‑Experts (MoE) architectures as a promising path towards greater efficiency. However, MoE models are plagued by two fundamental challenges: 1) a load imbalance problem known as the``rich get richer" phenomenon, where a few experts are over‑utilized, and 2) an expert homogeneity problem, where experts learn redundant representations, negating their purpose. Current solutions typically employ an auxiliary load‑balancing loss that, while mitigating imbalance, often exacerbates homogeneity by enforcing uniform routing at the expense of specialization. To resolve this, we introduce the Eigen‑Mixture‑of‑Experts (EMoE), a novel architecture that leverages a routing mechanism based on a learned orthonormal eigenbasis. EMoE projects input tokens onto this shared eigenbasis and routes them based on their alignment with the principal components of the feature space. This principled, geometric partitioning of data intrinsically promotes both balanced expert utilization and the development of diverse, specialized experts, all without the need for a conflicting auxiliary loss function. Our code is publicly available at https://github.com/Belis0811/EMoE.

Authors:Bing Hu, Yixin Li, Asma Bahamyirou, Helen Chen
Title: SynQP: A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data
Abstract:
The use of synthetic data in health applications raises privacy concerns, yet the lack of open frameworks for privacy evaluations has slowed its adoption. A major challenge is the absence of accessible benchmark datasets for evaluating privacy risks, due to difficulties in acquiring sensitive data. To address this, we introduce SynQP, an open framework for benchmarking privacy in synthetic data generation (SDG) using simulated sensitive data, ensuring that original data remains confidential. We also highlight the need for privacy metrics that fairly account for the probabilistic nature of machine learning models. As a demonstration, we use SynQP to benchmark CTGAN and propose a new identity disclosure risk metric that offers a more accurate estimation of privacy risks compared to existing approaches. Our work provides a critical tool for improving the transparency and reliability of privacy evaluations, enabling safer use of synthetic data in health‑related applications. % In our quality evaluations, non‑private models achieved near‑perfect machine‑learning efficacy \(\ge0.97\). Our privacy assessments (Table II) reveal that DP consistently lowers both identity disclosure risk (SD‑IDR) and membership‑inference attack risk (SD‑MIA), with all DP‑augmented models staying below the 0.09 regulatory threshold. Code available at https://github.com/CAN‑SYNH/SynQP

Authors:Siru Zhong, Junjie Qiu, Yangyu Wu, Yiqiu Liu, Yuanpeng He, Zhongwen Rao, Bin Yang, Chenjuan Guo, Hao Xu, Yuxuan Liang
Title: Learning to Factorize and Adapt: A Versatile Approach Toward Universal Spatio-Temporal Foundation Models
Abstract:
Spatio‑Temporal (ST) Foundation Models (STFMs) promise cross‑dataset generalization, yet joint ST pretraining is computationally expensive and grapples with the heterogeneity of domain‑specific spatial patterns. Substantially extending our preliminary conference version, we present FactoST‑v2, an enhanced factorized framework redesigned for full weight transfer and arbitrary‑length generalization. FactoST‑v2 decouples universal temporal learning from domain‑specific spatial adaptation. The first stage pretrains a minimalist encoder‑only backbone using randomized sequence masking to capture invariant temporal dynamics, enabling probabilistic quantile prediction across variable horizons. The second stage employs a streamlined adapter to rapidly inject spatial awareness via meta adaptive learning and prompting. Comprehensive evaluations across diverse domains demonstrate that FactoST‑v2 achieves state‑of‑the‑art accuracy with linear efficiency ‑ significantly outperforming existing foundation models in zero‑shot and few‑shot scenarios while rivaling domain‑specific expert baselines. This factorized paradigm offers a practical, scalable path toward truly universal STFMs. Code is available at https://github.com/CityMind‑Lab/FactoST.

Authors:Francisco Angulo de Lafuente, Vladimir Veselov, Richard Goodman
Title: Speaking to Silicon: Neural Communication with Bitcoin Mining ASICs
Abstract:
This definitive research memoria presents a comprehensive, mathematically verified paradigm for neural communication with Bitcoin mining Application‑Specific Integrated Circuits (ASICs), integrating five complementary frameworks: thermodynamic reservoir computing, hierarchical number system theory, algorithmic analysis, network latency optimization, and machine‑checked mathematical formalization. We establish that obsolete cryptocurrency mining hardware exhibits emergent computational properties enabling bidirectional information exchange between AI systems and silicon substrates. The research program demonstrates: (1) reservoir computing with NARMA‑10 Normalized Root Mean Square Error (NRMSE) of 0.8661; (2) the Thermodynamic Probability Filter (TPF) achieving 92.19% theoretical energy reduction; (3) the Virtual Block Manager achieving +25% effective hashrate; and (4) hardware universality across multiple ASIC families including Antminer S9, Lucky Miner LV06, and Goldshell LB‑Box. A significant contribution is the machine‑checked mathematical formalization using Lean 4 and Mathlib, providing unambiguous definitions, machine‑verified theorems, and reviewer‑proof claims. Key theorems proven include: independence implies zero leakage, predictor beats baseline implies non‑independence (the logical core of TPF), energy savings theoretical maximum, and Physical Unclonable Function (PUF) distinguishability witnesses. Vladimir Veselov's hierarchical number system theory explains why early‑round information contains predictive power. This work establishes a new paradigm: treating ASICs not as passive computational substrates but as active conversational partners whose thermodynamic state encodes exploitable computational information.

Authors:Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen
Title: R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning
Abstract:
Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R^2PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout‑Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.4% on MATH‑500 and 1.3% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at https://github.com/RRPO‑ARR/Code.

Authors:Fatih Maulana
Title: Impact of Circuit Depth versus Qubit Count on Variational Quantum Classifiers for Higgs Boson Signal Detection
Abstract:
High‑Energy Physics (HEP) experiments, such as those at the Large Hadron Collider (LHC), generate massive datasets that challenge classical computational limits. Quantum Machine Learning (QML) offers a potential advantage in processing high‑dimensional data; however, finding the optimal architecture for current Noisy Intermediate‑Scale Quantum (NISQ) devices remains an open challenge. This study investigates the performance of Variational Quantum Classifiers (VQC) in detecting Higgs Boson signals using the ATLAS Higgs Boson Machine Learning Challenge 2014 experiment dataset. We implemented a dimensionality reduction pipeline using Principal Component Analysis (PCA) to map 30 physical features into 4‑qubit and 8‑qubit latent spaces. We benchmarked three configurations: (A) a shallow 4‑qubit circuit, (B) a deep 4‑qubit circuit with increased entanglement layers, and (C) an expanded 8‑qubit circuit. Experimental results demonstrate that increasing circuit depth significantly improves performance, yielding the highest accuracy of 56.2% (Configuration B), compared to a baseline of 51.9%. Conversely, simply scaling to 8 qubits resulted in a performance degradation to 50.6% due to optimization challenges associated with Barren Plateaus in the larger Hilbert space. These findings suggest that for near‑term quantum hardware, prioritizing circuit depth and entanglement capability is more critical than increasing qubit count for effective anomaly detection in HEP data.

Authors:Jinshi Liu, Pan Liu
Title: A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning
Abstract:
Most pseudo‑label selection strategies in semi‑supervised learning rely on fixed confidence thresholds, implicitly assuming that prediction confidence reliably indicates correctness. In practice, deep networks are often overconfident: high‑confidence predictions can still be wrong, while informative low‑confidence samples near decision boundaries are discarded. This paper introduces a Confidence‑Variance (CoVar) theory framework that provides a principled joint reliability criterion for pseudo‑label selection. Starting from the entropy minimization principle, we derive a reliability measure that combines maximum confidence (MC) with residual‑class variance (RCV), which characterizes how probability mass is distributed over non‑maximum classes. The derivation shows that reliable pseudo‑labels should have both high MC and low RCV, and that the influence of RCV increases as confidence grows, thereby correcting overconfident but unstable predictions. From this perspective, we cast pseudo‑label selection as a spectral relaxation problem that maximizes separability in a confidence‑variance feature space, and design a threshold‑free selection mechanism to distinguish high‑ from low‑reliability predictions. We integrate CoVar as a plug‑in module into representative semi‑supervised semantic segmentation and image classification methods. Across PASCAL VOC 2012, Cityscapes, CIFAR‑10, and Mini‑ImageNet with varying label ratios and backbones, it consistently improves over strong baselines, indicating that combining confidence with residual‑class variance provides a more reliable basis for pseudo‑label selection than fixed confidence thresholds. (Code: https://github.com/ljs11528/CoVar_Pseudo_Label_Selection.git)

Authors:Arnav S. Sonavane
Title: Domain-Specific Self-Supervised Pre-training for Agricultural Disease Classification: A Hierarchical Vision Transformer Study
Abstract:
We investigate the impact of domain‑specific self‑supervised pre‑training on agricultural disease classification using hierarchical vision transformers. Our key finding is that SimCLR pre‑training on just 3,000 unlabeled agricultural images provides a +4.57% accuracy improvement‑‑exceeding the +3.70% gain from hierarchical architecture design. Critically, we show this SSL benefit is architecture‑agnostic: applying the same pre‑training to Swin‑Base yields +4.08%, to ViT‑Base +4.20%, confirming practitioners should prioritize domain data collection over architectural choices. Using HierarchicalViT (HVT), a Swin‑style hierarchical transformer, we evaluate on three datasets: Cotton Leaf Disease (7 classes, 90.24%), PlantVillage (38 classes, 96.3%), and PlantDoc (27 classes, 87.1%). At matched parameter counts, HVT‑Base (78M) achieves 88.91% vs. Swin‑Base (88M) at 87.23%, a +1.68% improvement. For deployment reliability, we report calibration analysis showing HVT achieves 3.56% ECE (1.52% after temperature scaling). Code: https://github.com/w2sg‑arnav/HierarchicalViT

Authors:Farzana Islam Adiba, Varsha Danduri, Fahmida Liza Piya, Ali Abbasi, Mehak Gupta, Rahmatollah Beheshti
Title: A Multimodal Data Processing Pipeline for MIMIC-IV Dataset
Abstract:
The MIMIC‑IV dataset is a large, publicly available electronic health record (EHR) resource widely used for clinical machine learning research. It comprises multiple modalities, including structured data, clinical notes, waveforms, and imaging data. Working with these disjointed modalities requires an extensive manual effort to preprocess and align them for downstream analysis. While several pipelines for MIMIC‑IV data extraction are available, they target a small subset of modalities or do not fully support arbitrary downstream applications. In this work, we greatly expand our prior popular unimodal pipeline and present a comprehensive and customizable multimodal pipeline that can significantly reduce multimodal processing time and enhance the reproducibility of MIMIC‑based studies. Our pipeline systematically integrates the listed modalities, enabling automated cohort selection, temporal alignment across modalities, and standardized multimodal output formats suitable for arbitrary static and time‑series downstream applications. We release the code, a simple UI, and a Python package for selective integration (with embedding) at https://github.com/healthylaife/MIMIC‑IV‑Data‑Pipeline.

Authors:Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel
Title: ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Abstract:
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well‑segmented inputs. Such conditions are rarely met in real‑world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off‑the‑shelf visual‑inertial SLAM, 3D detection algorithms, and vision‑language models to extract, for each object, a set of sparse SLAM points, posed multi‑view images, and machine‑generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high‑fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on‑the‑fly compositional augmentations, a curriculum training scheme spanning object‑ and scene‑level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in‑the‑wild objects across 7 real‑world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.

Authors:Dimitar Nedanovski, Svetoslav Nenov, Dimitar Pilev
Title: On the Probability of First Success in Differential Evolution: Hazard Identities and Tail Bounds
Abstract:
We study first‑hitting times in Differential Evolution (DE) through a conditional hazard frame work. Instead of analyzing convergence via Markov‑chain transition kernels or drift arguments, we ex press the survival probability of a measurable target set A as a product of conditional first‑hit probabilities (hazards) p_t=\Prob(E_t\mid\mathcal F_t‑1). This yields distribution‑free identities for survival and explicit tail bounds whenever deterministic lower bounds on the hazard hold on the survival event. For the L‑SHADE algorithm with current‑to‑pbest/1 mutation, we construct a checkable algorithmic witness event \mathcal L_t under which the conditional hazard admits an explicit lower bound depending only on sampling rules, population size, and crossover statistics. This separates theoretical constants from empirical event frequencies and explains why worst‑case constant‑hazard bounds are typically conservative. We complement the theory with a Kaplan‑‑Meier survival analysis on the CEC2017 benchmark suite . Across functions and budgets, we identify three distinct empirical regimes: (i) strongly clustered success, where hitting times concentrate in short bursts; (ii) approximately geometric tails, where a constant‑hazard model is accurate; and (iii) intractable cases with no observed hits within the evaluation horizon. The results show that while constant‑hazard bounds provide valid tail envelopes, the practical behavior of L‑SHADE is governed by burst‑like transitions rather than homogeneous per‑generati on success probabilities.

Authors:Oishee Bintey Hoque, Nibir Chandra Mandal, Kyle Luong, Amanda Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga
Title: PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs
Abstract:
Large‑scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure‑first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (i) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain‑tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component‑specific criteria; (ii) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross‑attention classifier; and (iii) outputs both CAFO type predictions and mask‑level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state‑of‑the‑art performance, with Swin‑B+PRISM‑CAFO surpassing the best performing baseline by up to 15%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient‑‑activation analyses that quantify the impact of domain priors and show how specific infrastructure (e.g., barns, lagoons) shapes classification decisions. We release code, infrastructure masks, and descriptors to support transparent, scalable monitoring of livestock infrastructure, enabling risk modeling, change detection, and targeted regulatory action. Github: https://github.com/Nibir088/PRISM‑CAFO.

Authors:Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Damien Garreau, Pierre-Alexandre Mattei
Title: When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models
Abstract:
Diffusion models now generate high‑quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well‑known way to improve supervised models, its application to unconditional score‑based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensembling the scores generally improves the score‑matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets. We confirm this observation across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, on CIFAR‑10 and FFHQ. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. We also look into tabular data through random forests, and find that one aggregation strategy outperforms the others. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g. guidance).

Authors:Xiaojie Gu, Guangxu Chen, Yuheng Yang, Jingxin Han, Andi Zhang
Title: Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models
Abstract:
Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE

Authors:Mark Eastwood, Thomas McKee, Zedong Hu, Sabine Tejpar, Fayyaz Minhas
Title: Beer-Lambert Autoencoder for Unsupervised Stain Representation Learning and Deconvolution in Multi-immunohistochemical Brightfield Histology Images
Abstract:
Separating the contributions of individual chromogenic stains in RGB histology whole slide images (WSIs) is essential for stain normalization, quantitative assessment of marker expression, and cell‑level readouts in immunohistochemistry (IHC). Classical Beer‑Lambert (BL) color deconvolution is well‑established for two‑ or three‑stain settings, but becomes under‑determined and unstable for multiplex IHC (mIHC) with K>3 chromogens. We present a simple, data‑driven encoder‑decoder architecture that learns cohort‑specific stain characteristics for mIHC RGB WSIs and yields crisp, well‑separated per‑stain concentration maps. The encoder is a compact U‑Net that predicts K nonnegative concentration channels; the decoder is a differentiable BL forward model with a learnable stain matrix initialized from typical chromogen hues. Training is unsupervised with a perceptual reconstruction objective augmented by loss terms that discourage unnecessary stain mixing. On a colorectal mIHC panel comprising 5 stains (H, CDX2, MUC2, MUC5, CD8) we show excellent RGB reconstruction, and significantly reduced inter‑channel bleed‑through compared with matrix‑based deconvolution. Code and model are available at https://github.com/measty/StainQuant.git.

Authors:Lorenzo Tomada, Federico Pichi, Gianluigi Rozza
Title: Latent Dynamics Graph Convolutional Networks for model order reduction of parameterized time-dependent PDEs
Abstract:
Graph Neural Networks (GNNs) are emerging as powerful tools for nonlinear Model Order Reduction (MOR) of time‑dependent parameterized Partial Differential Equations (PDEs). However, existing methodologies struggle to combine geometric inductive biases with interpretable latent behavior, overlooking dynamics‑driven features or disregarding spatial information. In this work, we address this gap by introducing Latent Dynamics Graph Convolutional Network (LD‑GCN), a purely data‑driven, encoder‑free architecture that learns a global, low‑dimensional representation of dynamical systems conditioned on external inputs and parameters. The temporal evolution is modeled in the latent space and advanced through time‑stepping, allowing for time‑extrapolation, and the trajectories are consistently decoded onto geometrically parameterized domains using a GNN. Our framework enhances interpretability by enabling the analysis of the reduced dynamics and supporting zero‑shot prediction through latent interpolation. The methodology is mathematically validated via a universal approximation theorem for encoder‑free architectures, and numerically tested on complex computational mechanics problems involving physical and geometric parameters, including the detection of bifurcating phenomena for Navier‑Stokes equations. Code availability: https://github.com/lorenzotomada/ld‑gcn‑rom

Authors:Yuling Shi, Maolin Sun, Zijun Liu, Mo Yang, Yixiong Fang, Tianran Sun, Xiaodong Gu
Title: Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering
Abstract:
Retrieval‑Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi‑hop question answering (QA). For multi‑hop QA tasks, current iterative approaches predominantly rely on LLMs to self‑guide and plan multi‑step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT‑RAG), a novel hierarchical framework for complex multi‑hop QA. RT‑RAG systematically decomposes multi‑hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus‑based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom‑up traversal strategy employs iterative query rewriting and refinement to collect high‑quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT‑RAG substantially outperforms state‑of‑the‑art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT‑RAG in complex multi‑hop QA.

Authors:Pascal Schlachter, Bin Yang
Title: GMM-COMET: Continual Source-Free Universal Domain Adaptation via a Mean Teacher and Gaussian Mixture Model-Based Pseudo-Labeling
Abstract:
Unsupervised domain adaptation tackles the problem that domain shifts between training and test data impair the performance of neural networks in many real‑world applications. Thereby, in realistic scenarios, the source data may no longer be available during adaptation, and the label space of the target domain may differ from the source label space. This setting, known as source‑free universal domain adaptation (SF‑UniDA), has recently gained attention, but all existing approaches only assume a single domain shift from source to target. In this work, we present the first study on continual SF‑UniDA, where the model must adapt sequentially to a stream of multiple different unlabeled target domains. Building upon our previous methods for online SF‑UniDA, we combine their key ideas by integrating Gaussian mixture model‑based pseudo‑labeling within a mean teacher framework for improved stability over long adaptation sequences. Additionally, we introduce consistency losses for further robustness. The resulting method GMM‑COMET provides a strong first baseline for continual SF‑UniDA and is the only approach in our experiments to consistently improve upon the source‑only model across all evaluated scenarios. Our code is available at https://github.com/pascalschlachter/GMM‑COMET.

Authors:Yuki Nakamura, Shingo Takemoto, Shunsuke Ono
Title: Comprehensive Robust Dynamic Mode Decomposition from Mode Extraction to Dimensional Reduction
Abstract:
We propose Comprehensive Robust Dynamic Mode Decomposition (CR‑DMD), a novel framework that robustifies the entire DMD process ‑ from mode extraction to dimensional reduction ‑ against mixed noise. Although standard DMD widely used for uncovering spatio‑temporal patterns and constructing low‑dimensional models of dynamical systems, it suffers from significant performance degradation under noise due to its reliance on least‑squares estimation for computing the linear time evolution operator. Existing robust variants typically modify the least‑squares formulation, but they remain unstable and fail to ensure faithful low‑dimensional representations. First, we introduce a convex optimization‑based preprocessing method designed to effectively remove mixed noise, achieving accurate and stable mode extraction. Second, we propose a new convex formulation for dimensional reduction that explicitly links the robustly extracted modes to the original noisy observations, constructing a faithful representation of the original data via a sparse weighted sum of the modes. Both stages are efficiently solved by a preconditioned primal‑dual splitting method. Experiments on fluid dynamics datasets demonstrate that CR‑DMD consistently outperforms state‑of‑the‑art robust DMD methods in terms of mode accuracy and fidelity of low‑dimensional representations under noisy conditions.

Authors:Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, Chris Lee
Title: Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer‑token perplexity drops while prompt‑side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor‑Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18‑20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering‑artificially amplifying or suppressing contamination‑driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR‑tuned models. Code is available at https://github.com/idwts/How‑RLVR‑Activates‑Memorization‑Shortcuts.

Authors:Mohsin Hasan, Viktor Ohanesian, Artem Gazizov, Yoshua Bengio, Alán Aspuru-Guzik, Roberto Bondesan, Marta Skreta, Kirill Neklyudov
Title: Discrete Feynman-Kac Correctors
Abstract:
Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non‑sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman‑Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine‑tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward‑tilted protein sequence generation.

Authors:Mark Kashirskiy, Ilya Makarov
Title: SuS: Strategy-aware Surprise for Intrinsic Exploration
Abstract:
We propose Strategy‑aware Surprise (SuS), a novel intrinsic motivation framework that uses pre‑post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity‑driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.

Authors:Yuxuan Lou, Kai Yang, Yang You
Title: MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
Abstract:
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality‑Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality‑appropriate experts based on input type. MAMoE simultaneously enhances modality‑specific learning and cross‑modal understanding through two complementary components: modality‑specific expert groups that capture domain‑specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post‑training on ASR and TTS datasets, followed by fine‑tuning with a carefully curated speech‑text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open‑source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality‑specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open‑source speech‑text LLM built on a Mixture of Experts architecture. \footnoteWe release MoST model, training code, inference code, and training data at https://github.com/NUS‑HPC‑AI‑Lab/MoST

Authors:Piyush Singh Pasi
Title: Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text
Abstract:
Multimodal models excel in English, supported by abundant image‑text and audio‑text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers‑‑using English text alone‑‑to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero‑shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text‑to‑Image retrieval. Qualitative t‑SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image‑text retrieval, M2M demonstrates robustness across datasets and tasks, extending to Audio‑Text retrieval and Text‑to‑Image generation. We release code and checkpoints (https://github.com/piyushsinghpasi/M2M) along with multilingual evaluation datasets: MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco‑multilingual‑30k), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps‑multilingual), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho‑multilingual).

Authors:Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen
Title: V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
Abstract:
Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision‑language models (VLMs). However, state‑of‑the‑art approaches rely heavily on large‑scale human‑annotated datasets, which are costly and time‑consuming to acquire. To overcome this limitation, we introduce V‑Zero, a general post‑training framework that facilitates self‑improvement using exclusively unlabeled images. V‑Zero establishes a co‑evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high‑quality, challenging questions by leveraging a dual‑track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo‑labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V‑Zero achieves consistent performance gains on Qwen2.5‑VL‑7B‑Instruct, improving visual mathematical reasoning by +1.7 and general vision‑centric by +2.6, demonstrating the potential of self‑improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V‑Zero

Authors:Hung Vinh Tran, Tong Chen, Hechuan Wen, Quoc Viet Hung Nguyen, Bin Cui, Hongzhi Yin
Title: Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection
Abstract:
Content‑based recommendation systems (CRSs) utilize content features to predict user‑item interactions, serving as essential tools for helping users navigate information‑rich web services. However, ensuring the effectiveness of CRSs requires large‑scale and even continuous model training to accommodate diverse user preferences, resulting in significant computational costs and resource demands. A promising approach to this challenge is coreset selection, which identifies a small but representative subset of data samples that preserves model quality while reducing training overhead. Yet, the selected coreset is vulnerable to the pervasive noise in user‑item interactions, particularly when it is minimally sized. To this end, we propose Noise‑aware Coreset Selection (NaCS), a specialized framework for CRSs. NaCS constructs coresets through submodular optimization based on training gradients, while simultaneously correcting noisy labels using a progressively trained model. Meanwhile, we refine the selected coreset by filtering out low‑confidence samples through uncertainty quantification, thereby avoid training with unreliable interactions. Through extensive experiments, we show that NaCS produces higher‑quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques. Notably, NaCS recovers 93‑95% of full‑dataset training performance using merely 1% of the training data. The source code is available at \hrefhttps://github.com/chenxing1999/nacshttps://github.com/chenxing1999/nacs.

Authors:Peter Jemley
Title: Continuous-Depth Transformers with Learned Control Dynamics
Abstract:
We present a hybrid transformer architecture that replaces discrete middle layers with a continuous‑depth Neural Ordinary Differential Equation (ODE) block, enabling inference‑time control over generation attributes via a learned steering signal. Unlike standard transformers that process representations through fixed discrete layers, our approach treats depth as a continuous variable governed by a learned vector field F_θ(H, τ, u), where u is a low‑dimensional control signal injected via explicit concatenation. We validate the architecture through four experiments: (1) gradient flow stability with zero exploding/vanishing gradient events, (2) semantic steering achieving 98%/88% accuracy for positive/negative sentiment control, (3) continuous interpolation validated by a negligible 0.068% trajectory divergence between fixed and adaptive solvers, and (4) efficiency benchmarking demonstrating latency parity with standard discrete baselines. Additionally, we show that adaptive ODE solvers reveal geometric structure in the learned dynamics: the control signal partitions the vector field into distinct dynamical regimes with different curvature characteristics. The adjoint method enables O(1) memory training regardless of integration depth. Our results demonstrate that continuous‑depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation.

Authors:Jack Wilkie, Hanan Hindy, Craig Michie, Christos Tachtatzis, James Irvine, Robert Atkinson
Title: A Novel Contrastive Loss for Zero-Day Network Intrusion Detection
Abstract:
Machine learning has achieved state‑of‑the‑art results in network intrusion detection; however, its performance significantly degrades when confronted by a new attack class ‑‑ a zero‑day attack. In simple terms, classical machine learning‑based approaches are adept at identifying attack classes on which they have been previously trained, but struggle with those not included in their training data. One approach to addressing this shortcoming is to utilise anomaly detectors which train exclusively on benign data with the goal of generalising to all attack classes ‑‑ both known and zero‑day. However, this comes at the expense of a prohibitively high false positive rate. This work proposes a novel contrastive loss function which is able to maintain the advantages of other contrastive learning‑based approaches (robustness to imbalanced data) but can also generalise to zero‑day attacks. Unlike anomaly detectors, this model learns the distributions of benign traffic using both benign and known malign samples, i.e. other well‑known attack classes (not including the zero‑day class), and consequently, achieves significant performance improvements. The proposed approach is experimentally verified on the Lycos2017 dataset where it achieves an AUROC improvement of .000065 and .060883 over previous models in known and zero‑day attack detection, respectively. Finally, the proposed method is extended to open‑set recognition achieving OpenAUC improvements of .170883 over existing approaches.

Authors:Nguyen Minh Phuong, Dang Huu Tien, Naoya Inoue
Title: Improving Chain-of-Thought for Logical Reasoning via Attention-Aware Intervention
Abstract:
Modern logical reasoning with LLMs primarily relies on employing complex interactive frameworks that decompose the reasoning process into subtasks solved through carefully designed prompts or requiring external resources (e.g., symbolic solvers) to exploit their strong logical structures. While interactive approaches introduce additional overhead or depend on external components, which limit their scalability. In this work, we introduce a non‑interactive, end‑to‑end framework for reasoning tasks, enabling reasoning to emerge within the model itself‑improving generalization while preserving analyzability without any external resources. We show that introducing structural information into the few‑shot prompt activates a subset of attention heads that patterns aligned with logical reasoning operators. Building on this insight, we propose Attention‑Aware Intervention (AAI), an inference‑time intervention method that reweights attention scores across selected heads identified by their logical patterns. AAI offers an efficient way to steer the model's reasoning toward leveraging prior knowledge through attention modulation. Extensive experiments show that AAI enhances logical reasoning performance across diverse benchmarks, and model architectures, while incurring negligible additional computational overhead. Code is available at https://github.com/phuongnm94/aai_for_logical_reasoning.

Authors:Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
Title: Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Abstract:
Vision‑Language‑Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain‑of‑thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast‑ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast‑ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference‑guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning‑enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast‑ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state‑of‑the‑art reasoning VLAs, while maintaining effective long‑horizon planning, few‑shot adaptation, and failure recovery.

Authors:Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal
Title: Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
Abstract:
Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground‑truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high‑level task descriptions by generator LLMs. We evaluate query‑answer routers (using both queries and labels) and query‑only routers across four diverse benchmarks and 12 models, finding that query‑answer routers degrade faster than query‑only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query‑only router that estimates model correctness through consensus voting and identifies model‑specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query‑answer router by 4.6% absolute accuracy when trained on weak generator data.

Authors:Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, Chung-Piaw Teo
Title: LLM for Large-Scale Optimization Model Auto-Formulation: Bridging Flexibility and Standardization via Agentic Workflow
Abstract:
Large‑scale optimization is a key backbone of modern business decision‑making. However, building these models is often labor‑intensive and time‑consuming. We address this by proposing LEAN‑LLM‑OPT, a LightwEight AgeNtic workflow construction framework for LLM‑assisted large‑scale OPTimization auto‑formulation. LEAN‑LLM‑OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step‑by‑step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. The agentic workflow leverages common modeling practices to standardize the modeling process into a sequence of structured sub‑tasks, offloading mechanical data‑handling operations to auxiliary tools. This reduces the LLM's burden in planning and data handling, allowing us to exploit its flexibility to address unstructured components. Extensive simulations show that LEAN‑LLM‑OPT, instantiated with GPT‑4.1 and the open source gpt‑oss‑20B, achieves strong performance on large‑scale optimization modeling tasks and is competitive with state‑of‑the‑art approaches. In addition, in a Singapore Airlines choice‑based revenue management use case, LEAN‑LLM‑OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large‑Scale‑OR and Air‑NRM, the first comprehensive benchmarks for large‑scale optimization auto‑formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean‑llm‑opt.

Authors:Ralf Römer, Yi Zhang, Angela P. Schoellig
Title: CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion
Abstract:
To teach robots complex manipulation tasks, it is now a common practice to fine‑tune a pre‑trained vision‑language‑action model (VLA) on task‑specific data. However, since this recipe updates existing representations, it is unsuitable for long‑term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter‑efficient framework for exemplar‑free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected feedforward layers and autonomously expands the model only where necessary when learning a new task, guided by layer‑wise feature similarity. During deployment, an autoencoder‑based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar‑based methods. Code and data are available at https://tum‑lsy.github.io/clare.

Authors:Ritabrata Chakraborty, Hrishit Mitra, Shivakumara Palaiahnakote, Umapada Pal
Title: Towards Robust Cross-Dataset Object Detection Generalization under Domain Specificity
Abstract:
Object detectors often perform well in‑distribution, yet degrade sharply on a different benchmark. We study cross‑dataset object detection (CD‑OD) through a lens of setting specificity. We group benchmarks into setting‑agnostic datasets with diverse everyday scenes and setting‑specific datasets tied to a narrow environment, and evaluate a standard detector family across all train‑‑test pairs. This reveals a clear structure in CD‑OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open‑label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed‑label transfer with an open‑label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open‑label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near‑misses supported by the image evidence. Overall, we provide a principled characterization of CD‑OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href[https://github.com/Ritabrata04/cdod‑icpr.githttps://github.com/Ritabrata04/cdod‑icpr.

Authors:Renqiang Luo, Yongshuai Yang, Huafei Huang, Qing Qing, Mingliang Hou, Ziqi Xu, Yi Yu, Jingjing Zhou, Feng Xia
Title: FairGU: Fairness-aware Graph Unlearning in Social Networks
Abstract:
Graph unlearning has emerged as a critical mechanism for supporting sustainable and privacy‑preserving social networks, enabling models to remove the influence of deleted nodes and thereby better safeguard user information. However, we observe that existing graph unlearning techniques insufficiently protect sensitive attributes, often leading to degraded algorithmic fairness compared with traditional graph learning methods. To address this gap, we introduce FairGU, a fairness‑aware graph unlearning framework designed to preserve both utility and fairness during the unlearning process. FairGU integrates a dedicated fairness‑aware module with effective data protection strategies, ensuring that sensitive attributes are neither inadvertently amplified nor structurally exposed when nodes are removed. Through extensive experiments on multiple real‑world datasets, we demonstrate that FairGU consistently outperforms state‑of‑the‑art graph unlearning methods and fairness‑enhanced graph learning baselines in terms of both accuracy and fairness metrics. Our findings highlight a previously overlooked risk in current unlearning practices and establish FairGU as a robust and equitable solution for the next generation of socially sustainable networked systems. The codes are available at https://github.com/LuoRenqiang/FairGU.

Authors:Maria Sdraka, Dimitrios Michail, Ioannis Papoutsis
Title: Magnifying change: Rapid burn scar mapping with multi-resolution, multi-source satellite imagery
Abstract:
Delineating wildfire affected areas using satellite imagery remains challenging due to irregular and spatially heterogeneous spectral changes across the electromagnetic spectrum. While recent deep learning approaches achieve high accuracy when high‑resolution multispectral data are available, their applicability in operational settings, where a quick delineation of the burn scar shortly after a wildfire incident is required, is limited by the trade‑off between spatial resolution and temporal revisit frequency of current satellite systems. To address this limitation, we propose a novel deep learning model, namely BAM‑MRCD, which employs multi‑resolution, multi‑source satellite imagery (MODIS and Sentinel‑2) for the timely production of detailed burnt area maps with high spatial and temporal resolution. Our model manages to detect even small scale wildfires with high accuracy, surpassing similar change detection models as well as solid baselines. All data and code are available in the GitHub repository: https://github.com/Orion‑AI‑Lab/BAM‑MRCD.

Authors:Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang
Title: GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization
Abstract:
The prevailing post‑training paradigm for Large Reasoning Models (LRMs)‑‑Supervised Fine‑Tuning (SFT) followed by Reinforcement Learning (RL)‑‑suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post‑training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero‑temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite‑temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post‑training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post‑training. Our code is available at https://github.com/zzy1127/GIFT.

Authors:Zhixiang Liang, Beichen Huang, Zheng Wang, Minjia Zhang
Title: Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling
Abstract:
Large Language Models (LLMs) can enhance reasoning capabilities through test‑time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end‑to‑end latency. Prior work on accelerating this process has relied on similarity‑based or confidence‑based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step‑level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory‑aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end‑to‑end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end‑to‑end inference latency by 45%‑70% on average compared to self‑consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing‑System‑AI‑Lab/STEP

Authors:Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye
Title: Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
Abstract:
In this report, we introduce DASD‑4B‑Thinking, a lightweight yet highly capable, fully open‑source reasoning model. It achieves SOTA performance among open‑source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation ‑‑ even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher‑generated responses, also known as sequence‑level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself ‑‑ enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence‑level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher‑forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher‑student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence‑level distillation training pipeline. Remarkably, DASD‑4B‑Thinking obtains competitive results using only 448K training samples ‑‑ an order of magnitude fewer than those employed by most existing open‑source efforts. To support community research, we publicly release our models and the training dataset.

Authors:Jiahao Qin, Yiwen Wang
Title: Learning Domain-Invariant Representations for Cross-Domain Image Registration via Scene-Appearance Disentanglement
Abstract:
Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill‑posed. We propose SAR‑Net, a unified framework that addresses this challenge through principled scene‑appearance disentanglement. Our key insight is that observed images can be decomposed into domain‑invariant scene representations and domain‑specific appearance codes, enabling registration via re‑rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross‑domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR‑Net on the ANHIR (Automatic Non‑rigid Histological Image Registration) challenge benchmark, where multi‑stain histopathology images exhibit coupled domain shift from different staining protocols and geometric distortion from tissue preparation. Our method achieves a median relative Target Registration Error (rTRE) of 0.25%, outperforming the state‑of‑the‑art MEVIS method (0.27% rTRE) by 7.4%, with robustness of 99.1%. Code is available at https://github.com/D‑ST‑Sword/SAR‑NET .

Authors:Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
Title: Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge
Abstract:
Large language models often solve complex reasoning tasks more effectively with Chain‑of‑Thought (CoT), but at the cost of long, low‑bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on‑policy reinforcement learning (RL). Importantly, Multiplex Thinking is self‑adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR‑Penn/Multiplex‑Thinking.

Authors:Vikas Dwivedi, Monica Sigovan, Bruno Sixou
Title: Soft Partition-based KAPI-ELM for Multi-Scale PDEs
Abstract:
Physics‑informed machine learning holds great promise for solving differential equations, yet existing methods struggle with highly oscillatory, multiscale, or singularly perturbed PDEs due to spectral bias, costly backpropagation, and manually tuned kernel or Fourier frequencies. This work introduces a soft partition‑‑based Kernel‑Adaptive Physics‑Informed Extreme Learning Machine (KAPI‑ELM), a deterministic low‑dimensional parameterization in which smooth partition lengths jointly control collocation centers and Gaussian kernel widths, enabling continuous coarse‑to‑fine resolution without Fourier features, random sampling, or hard domain interfaces. A signed‑distance‑based weighting further stabilizes least‑squares learning on irregular geometries. Across eight benchmarks‑‑including oscillatory ODEs, high‑frequency Poisson equations, irregular‑shaped domains, and stiff singularly perturbed convection‑diffusion problems‑the proposed method matches or exceeds the accuracy of state‑of‑the‑art Physics‑Informed Neural Network (PINN) and Theory of Functional Connections (TFC) variants while using only a single linear solve. Although demonstrated on steady linear PDEs, the results show that soft‑partition kernel adaptation provides a fast, architecture‑free approach for multiscale PDEs with broad potential for future physics‑informed modeling. For reproducibility, the reference codes are available at https://github.com/vikas‑dwivedi‑2022/soft_kapi

Authors:Miaomiao Cai, Zhijie Zhang, Junfeng Fang, Zhiyong Cheng, Xiang Wang, Meng Wang
Title: RMBRec: Robust Multi-Behavior Recommendation towards Target Behaviors
Abstract:
Multi‑behavior recommendation faces a critical challenge in practice: auxiliary behaviors (e.g., clicks, carts) are often noisy, weakly correlated, or semantically misaligned with the target behavior (e.g., purchase), which leads to biased preference learning and suboptimal performance. While existing methods attempt to fuse these heterogeneous signals, they inherently lack a principled mechanism to ensure robustness against such behavioral inconsistency. In this work, we propose Robust Multi‑Behavior Recommendation towards Target Behaviors (RMBRec), a robust multi‑behavior recommendation framework grounded in an information‑theoretic robustness principle. We interpret robustness as a joint process of maximizing predictive information while minimizing its variance across heterogeneous behavioral environments. Under this perspective, the Representation Robustness Module (RRM) enhances local semantic consistency by maximizing the mutual information between users' auxiliary and target representations, whereas the Optimization Robustness Module (ORM) enforces global stability by minimizing the variance of predictive risks across behaviors, which is an efficient approximation to invariant risk minimization. This local‑global collaboration bridges representation purification and optimization invariance in a theoretically coherent way. Extensive experiments on three real‑world datasets demonstrate that RMBRec not only outperforms state‑of‑the‑art methods in accuracy but also maintains remarkable stability under various noise perturbations. For reproducibility, our code is available at https://github.com/miaomiao‑cai2/RMBRec/.

Authors:Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong
Title: RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation
Abstract:
The LLM‑as‑a‑Judge paradigm promises scalable rubric‑based evaluation, yet aligning frozen black‑box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence‑anchored Robust Scoring), a compiler‑executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein‑based post‑hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.

Authors:Renyang Liu, Kangjie Chen, Han Qiu, Jie Zhang, Kwok-Yan Lam, Tianwei Zhang, See-Kiong Ng
Title: SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models
Abstract:
Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real‑world deployments and cannot be reliably mitigated by post‑hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine‑grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference‑time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token‑level interventions in the embedding space. The framework comprises two core components: a latent‑aware multi‑modal safety classifier for identifying unsafe generation trajectories, and a token‑level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug‑and‑play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

Authors:Jia-Xin He, Hung-Hsuan Chen
Title: GraphFusionSBR: Denoising Multi-Channel Graphs for Session-Based Recommendation
Abstract:
Session‑based recommendation systems must capture implicit user intents from sessions. However, existing models suffer from issues such as item interaction dominance and noisy sessions. We propose a multi‑channel recommendation model, including a knowledge graph channel, a session hypergraph channel, and a session line graph channel, to capture information from multiple sources. Our model adaptively removes redundant edges in the knowledge graph channel to reduce noise. Knowledge graph representations cooperate with hypergraph representations for prediction to alleviate item dominance. We also generate in‑session attention for denoising. Finally, we maximize mutual information between the hypergraph and line graph channels as an auxiliary task. Experiments demonstrate that our method enhances the accuracy of various recommendations, including e‑commerce and multimedia recommendations. We release the code on GitHub for reproducibility.\footnotehttps://github.com/hohehohe0509/DSR‑HK

Authors:Pei Heng, Yi Sun, Jianhua Guo
Title: Structural Dimension Reduction in Bayesian Networks
Abstract:
This work introduces a novel technique, named structural dimension reduction, to collapse a Bayesian network onto a minimum and localized one while ensuring that probabilistic inferences between the original and reduced networks remain consistent. To this end, we propose a new combinatorial structure in directed acyclic graphs called the directed convex hull, which has turned out to be equivalent to their minimum localized Bayesian networks. An efficient polynomial‑time algorithm is devised to identify them by determining the unique directed convex hulls containing the variables of interest from the original networks. Experiments demonstrate that the proposed technique has high dimension reduction capability in real networks, and the efficiency of probabilistic inference based on directed convex hulls can be significantly improved compared with traditional methods such as variable elimination and belief propagation algorithms. The code of this study is open at \hrefhttps://github.com/Balance‑H/Algorithmshttps://github.com/Balance‑H/Algorithms and the proofs of the results in the main body are postponed to the appendix.

Authors:Taminul Islam, Toqi Tahamid Sarker, Mohamed Embaby, Khaled R Ahmed, Amer AbuGhazaleh
Title: FUME: Fused Unified Multi-Gas Emission Network for Livestock Rumen Acidosis Detection
Abstract:
Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi‑gas Emission Network), the first deep learning approach for rumen acidosis detection from dual‑gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual‑stream architecture with weight‑shared encoders, modality‑specific self‑attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual‑gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel‑level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs‑‑outperforming state‑of‑the‑art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual‑task learning is essential for optimal performance. Our work establishes the feasibility of gas emission‑based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at https://github.com/taminulislam/fume.

Authors:Anh H. Vo, Tae-Seok Kim, Hulin Jin, Soo-Mi Choi, Yong-Guk Kim
Title: Instruction-Driven 3D Facial Expression Generation and Transition
Abstract:
A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction‑driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction‑driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state‑of‑the‑art methods on the CK+ and CelebV‑HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at https://vohoanganh.github.io/tg3dfet/

Authors:Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli
Title: MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness
Abstract:
Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine‑tuning data. However, naive "act‑as‑a‑user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so‑called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human‑like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata‑driven registries, multi‑backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance‑aware harness. We include three lexical‑diversity metrics (MATTR, YULE'S K, and HD‑D) and three LLM‑judge‑based metrics (GTEval, Pairwise Indistinguishability, and Rubric‑and‑Reason). Across four open datasets, MIRRORBENCH yields variance‑aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command‑line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.

Authors:Zheng Zhou, Isabella McEvoy, Camilo E. Valderrama
Title: Local-Global Feature Fusion for Subject-Independent EEG Emotion Recognition
Abstract:
Subject‑independent EEG emotion recognition is challenged by pronounced inter‑subject variability and the difficulty of learning robust representations from short, noisy recordings. To address this, we propose a fusion framework that integrates (i) local, channel‑wise descriptors and (ii) global, trial‑level descriptors, improving cross‑subject generalization on the SEED‑VII dataset. Local representations are formed per channel by concatenating differential entropy with graph‑theoretic features, while global representations summarize time‑domain, spectral, and complexity characteristics at the trial level. These representations are fused in a dual‑branch transformer with attention‑based fusion and domain‑adversarial regularization, with samples filtered by an intensity threshold. Experiments under a leave‑one‑subject‑out protocol demonstrate that the proposed method consistently outperforms single‑view and classical baselines, achieving approximately 40% mean accuracy in 7‑class subject‑independent emotion recognition. The code has been released at https://github.com/Danielz‑z/LGF‑EEG‑Emotion.

Authors:Xin Dai, Pengcheng Huang, Zhenghao Liu, Shuo Wang, Yukun Yan, Chaojun Xiao, Yu Gu, Ge Yu, Maosong Sun
Title: Revealing the Attention Floating Mechanism in Masked Diffusion Models
Abstract:
Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under‑explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure‑Aware, Deep Content‑Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in‑context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge‑intensive tasks. All codes and datasets are available at https://github.com/NEUIR/Attention‑Floating.

Authors:Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, Dapeng Wu
Title: Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification
Abstract:
The deployment of Large Language Models (LLMs) on resource‑constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to ‑1, 0, +1, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2‑bit aligned packing, which incurs significant bit wastage, or 1.67‑bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware‑efficient ternary quantization framework. Sherry introduces a 3:4 fine‑grained sparsity that achieves a regularized 1.25‑bit width by packing blocks of four weights into five bits, restoring power‑of‑two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA‑3.2 across five benchmarks demonstrate that Sherry matches state‑of‑the‑art ternary performance while significantly reducing model size. Notably, on an Intel i7‑14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at https://github.com/Tencent/AngelSlim .

Authors:Simon Jegou, Maximilian Jeblick
Title: KVzap: Fast, Adaptive, and Faithful KV Cache Pruning
Abstract:
Growing context lengths in transformer‑based language models have made the key‑value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed‑‑accuracy trade‑offs. We introduce KVzap, a fast, input‑adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3‑8B, Llama‑3.1‑8B‑Instruct, and Qwen3‑32B across long‑context and reasoning tasks, KVzap achieves 2‑‑4× KV cache compression with negligible accuracy loss and achieves state‑of‑the‑art performance on the KVpress leaderboard. Code and models are available at https://github.com/NVIDIA/kvpress.

Authors:Hao-Xiang Xu, Jun-Yu Ma, Ziqi Peng, Yuhao Sun, Zhen-Hua Ling, Jia-Chen Gu
Title: Multiplicative Orthogonal Sequential Editing for Language Models
Abstract:
Knowledge editing aims to efficiently modify the internal knowledge of large language models (LLMs) without compromising their other capabilities. The prevailing editing paradigm, which appends an update matrix to the original parameter matrix, has been shown by some studies to damage key numerical stability indicators (such as condition number and norm), thereby reducing editing performance and general abilities, especially in sequential editing scenario. Although subsequent methods have made some improvements, they remain within the additive framework and have not fundamentally addressed this limitation. To solve this problem, we analyze it from both statistical and mathematical perspectives and conclude that multiplying the original matrix by an orthogonal matrix does not change the numerical stability of the matrix. Inspired by this, different from the previous additive editing paradigm, a multiplicative editing paradigm termed Multiplicative Orthogonal Sequential Editing (MOSE) is proposed. Specifically, we first derive the matrix update in the multiplicative form, the new knowledge is then incorporated into an orthogonal matrix, which is multiplied by the original parameter matrix. In this way, the numerical stability of the edited matrix is unchanged, thereby maintaining editing performance and general abilities. We compared MOSE with several current knowledge editing methods, systematically evaluating their impact on both editing performance and the general abilities across three different LLMs. Experimental results show that MOSE effectively limits deviations in the edited parameter matrix and maintains its numerical stability. Compared to current methods, MOSE achieves a 12.08% improvement in sequential editing performance, while retaining 95.73% of general abilities across downstream tasks. The code is available at https://github.com/famoustourist/MOSE.

Authors:Nina Peire, Yupei Li, Björn Schuller
Title: Affect and Effect: Limitations of regularisation-based continual learning in EEG-based emotion classification
Abstract:
Generalisation to unseen subjects in EEG‑based emotion classification remains a challenge due to high inter‑and intra‑subject variability. Continual learning (CL) poses a promising solution by learning from a sequence of tasks while mitigating catastrophic forgetting. Regularisation‑based CL approaches, such as Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI), and Memory Aware Synapses (MAS), are commonly used as baselines in EEG‑based CL studies, yet their suitability for this problem remains underexplored. This study theoretically and empirically finds that regularisation‑based CL methods show limited performance for EEG‑based emotion classification on the DREAMER and SEED datasets. We identify a fundamental misalignment in the stability‑plasticity trade‑off, where regularisation‑based methods prioritise mitigating catastrophic forgetting (backward transfer) over adapting to new subjects (forward transfer). We investigate this limitation under subject‑incremental sequences and observe that: (1) the heuristics for estimating parameter importance become less reliable under noisy data and covariate shift, (2) gradients on parameters deemed important by these heuristics often interfere with gradient updates required for new subjects, moving optimisation away from the minimum, (3) importance values accumulated across tasks over‑constrain the model, and (4) performance is sensitive to subject order. Forward transfer showed no statistically significant improvement over sequential fine‑tuning (p > 0.05 across approaches and datasets). The high variability of EEG signals means past subjects provide limited value to future subjects. Regularisation‑based continual learning approaches are therefore limited for robust generalisation to unseen subjects in EEG‑based emotion classification.

Authors:Wen Guo
Title: DT-ICU: Towards Explainable Digital Twins for ICU Patient Monitoring via Multi-Modal and Multi-Task Iterative Inference
Abstract:
We introduce DT‑ICU, a multimodal digital twin framework for continuous risk estimation in intensive care. DT‑ICU integrates variable‑length clinical time series with static patient information in a unified multitask architecture, enabling predictions to be updated as new observations accumulate over the ICU stay. We evaluate DT‑ICU on the large, publicly available MIMIC‑IV dataset, where it consistently outperforms established baseline models under different evaluation settings. Our test‑length analysis shows that meaningful discrimination is achieved shortly after admission, while longer observation windows further improve the ranking of high‑risk patients in highly imbalanced cohorts. To examine how the model leverages heterogeneous data sources, we perform systematic modality ablations, revealing that the model learnt a reasonable structured reliance on interventions, physiological response observations, and contextual information. These analyses provide interpretable insights into how multimodal signals are combined and how trade‑offs between sensitivity and precision emerge. Together, these results demonstrate that DT‑ICU delivers accurate, temporally robust, and interpretable predictions, supporting its potential as a practical digital twin framework for continuous patient monitoring in critical care. The source code and trained model weights for DT‑ICU are publicly available at https://github.com/GUO‑W/DT‑ICU‑release.

Authors:Masahiro Kato
Title: Riesz Representer Fitting under Bregman Divergence: A Unified Framework for Debiased Machine Learning
Abstract:
Estimating the Riesz representer is central to debiased machine learning for causal and structural parameter estimation. We propose generalized Riesz regression, a unified framework that estimates the Riesz representer by fitting a representer model via Bregman divergence minimization. This framework includes the squared loss and the Kullback‑‑Leibler (KL) divergence as special cases: the former recovers Riesz regression, while the latter recovers tailored loss minimization. Under suitable model specifications, the dual problems correspond to covariate balancing, which we call automatic covariate balancing. Moreover, under the same specifications, outcome averages weighted by the estimated Riesz representer satisfy Neyman orthogonality even without estimating the regression function, a property we call automatic Neyman orthogonalization. This property not only reduces the estimation error of Neyman orthogonal scores but also clarifies a key distinction between debiased machine learning and targeted maximum likelihood estimation. Our framework can also be viewed as a generalization of density ratio fitting under Bregman divergences to Riesz representer estimation, and it applies beyond density ratio estimation. We provide convergence analyses for both reproducing kernel Hilbert space (RKHS) and neural network model classes. A Python package for generalized Riesz regression is available at https://github.com/MasaKat0/grr.

Authors:Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
Title: Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference
Abstract:
Due to the prevalence of large language models (LLMs), key‑value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer‑wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre‑defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training‑free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user‑specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one‑shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state‑of‑the‑art layer‑wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.

Authors:Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, Hao Zhang
Title: d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation
Abstract:
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random‑order generation. However, realizing these benefits in practice is non‑trivial, as dLLMs inherently face an accuracy‑parallelism trade‑off. Despite increasing interest, existing methods typically focus on only one‑side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo‑Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo‑trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy‑based multi‑block decoding with a KV‑cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10× speedup over vanilla LLaDA/Dream and 5× speedup over AR models without much accuracy drop. Our code is available at https://github.com/hao‑ai‑lab/d3LLM.

Authors:Zexi Tan, Tao Xie, Haoyi Xiao, Baoyao Yang, Yuzhu Ji, An Zeng, Xiang Zhang, Yiqun Zhang
Title: TFEC: Multivariate Time-Series Clustering via Temporal-Frequency Enhanced Contrastive Learning
Abstract:
Multivariate Time‑Series (MTS) clustering is crucial for signal processing and data analysis. Although deep learning approaches, particularly those leveraging Contrastive Learning (CL), are prominent for MTS representation, existing CL‑based models face two key limitations: 1) neglecting clustering information during positive/negative sample pair construction, and 2) introducing unreasonable inductive biases, e.g., destroying time dependence and periodicity through augmentation strategies, compromising representation quality. This paper, therefore, proposes a Temporal‑Frequency Enhanced Contrastive (TFEC) learning framework. To preserve temporal structure while generating low‑distortion representations, a temporal‑frequency Co‑EnHancement (CoEH) mechanism is introduced. Accordingly, a synergistic dual‑path representation and cluster distribution learning framework is designed to jointly optimize cluster structure and representation fidelity. Experiments on six real‑world benchmark datasets demonstrate TFEC's superiority, achieving 4.48% average NMI gains over SOTA methods, with ablation studies validating the design. The code of the paper is available at: https://github.com/yueliangy/TFEC.

Authors:Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Peng Zhang, Xindian Ma
Title: ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs
Abstract:
The emergence of fine‑grained numerical formats like NVFP4 presents new opportunities for efficient Large Language Model (LLM) inference. However, it is difficult to adapt existing Post‑Training Quantization (PTQ) strategies to these formats: rotation‑based methods compromise fine‑grained block isolation; smoothing techniques struggle with significant 4‑bit quantization errors; and mixed‑precision approaches often conflict with hardware constraints on unified‑precision computation. To address these challenges, we propose ARCQuant, a framework that boosts NVFP4 performance via Augmented Residual Channels. Distinct from methods that compromise block isolation or hardware uniformity, ARCQuant maintains a strictly unified NVFP4 format by augmenting the activation matrix with quantized residual channels. This design integrates the error compensation process directly into the matrix reduction dimension, enabling the use of standard, highly optimized GEMM kernels with minimal overhead. Theoretical analysis confirms that the worst‑case error bound of our dual‑stage NVFP4 quantization is comparable to that of standard 8‑bit formats such as MXFP8. Extensive experiments on LLaMA and Qwen models demonstrate that ARCQuant achieves state‑of‑the‑art accuracy, comparable to full‑precision baselines in perplexity and downstream tasks. Furthermore, deployment on RTX 5090 and RTX PRO 6000 GPUs confirms practical benefits, achieving up to 3x speedup over FP16. Our code is available at https://github.com/actypedef/ARCQuant .

Authors:Michael J. Clark
Title: AntiPaSTO: Self-Supervised Steering of Moral Reasoning
Abstract:
As models grow more capable, human supervision breaks down: labels don't scale, outputs can be gamed, and training doesn't generalize. Scalable oversight requires steering methods that are internal, self‑supervised, and transfer out‑of‑distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an anti‑parallel axis (α=\pm1 produce opposite shifts), with coherence constraints preventing collapse. Human input is minimal: two contrasting words inserted into template sentences, no preference labels. Using 800 such pairs on Gemma‑3‑1B, AntiPaSTO beats prompting baselines by 6.9 times on DailyDilemmas and maintains bidirectional control where prompting triggers refusal.

Authors:Zhuoka Feng, Kang Chen, Sihan Zhao, Kai Xiong, Yaoning Wang, Minshen Yu, Junjie Nian, Changyi Xiao, Yixin Cao, Yugang Jiang
Title: ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging
Abstract:
Interactive large language model agents have advanced rapidly, but most remain specialized to a single environment and fail to adapt robustly to other environments. Model merging offers a training‑free alternative by integrating multiple experts into a single model. In this paper, we propose Agent‑Role Merging (ARM), an activation‑guided, role‑conditioned neuron transplantation method for model merging in LLM agents. ARM improves existing merging methods from static natural language tasks to multi‑turn agent scenarios, and over the generalization ability across various interactive environments. This is achieved with a well designed 3‑step framework: 1) constructing merged backbones, 2) selection based on its role‑conditioned activation analysis, and 3) neuron transplantation for fine‑grained refinements. Without gradient‑based optimization, ARM improves cross‑benchmark generalization while enjoying efficiency. Across diverse domains, the model obtained via ARM merging outperforms prior model merging methods and domain‑specific expert models, while demonstrating strong out‑of‑domain generalization.

Authors:Ruiyi Ding, Yongxuan Lv, Xianhui Meng, Jiahe Song, Chao Wang, Chen Jiang, Yuan Cheng
Title: PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization
Abstract:
Policy optimization for large language models often suffers from sparse reward signals in multi‑step reasoning tasks. Critic‑free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low‑reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process‑level guidance in a critic‑free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token‑level advantages, and aligns their distribution with outcome advantages through location‑parameter shift. On MATH500, PRPO improves Qwen2.5‑Math‑1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine‑grained credit assignment within critic‑free optimization. Code is available at: https://github.com/SchumiDing/srpocode

Authors:Osama Yousuf, Andreu L. Glasmann, Martin Lueker-Boden, Sina Najmaei, Gina C. Adam
Title: XBTorch: A Unified Framework for Modeling and Co-Design of Crossbar-Based Deep Learning Accelerators
Abstract:
Emerging memory technologies have gained significant attention as a promising pathway to overcome the limitations of conventional computing architectures in deep learning applications. By enabling computation directly within memory, these technologies ‑ built on nanoscale devices with tunable and nonvolatile conductance ‑ offer the potential to drastically reduce energy consumption and latency compared to traditional von Neumann systems. This paper introduces XBTorch (short for CrossBarTorch), a novel simulation framework that integrates seamlessly with PyTorch and provides specialized tools for accurately and efficiently modeling crossbar‑based systems based on emerging memory technologies. Through detailed comparisons and case studies involving hardware‑aware training and inference, we demonstrate how XBTorch offers a unified interface for key research areas such as device‑level modeling, cross‑layer co‑design, and inference‑time fault tolerance. While exemplar studies utilize ferroelectric field‑effect transistor (FeFET) models, the framework remains technology‑agnostic ‑ supporting other emerging memories such as resistive RAM (ReRAM), as well as enabling user‑defined custom device models. The code is publicly available at: https://github.com/ADAM‑Lab‑GW/xbtorch

Authors:Vladimer Khasia
Title: HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression
Abstract:
Post‑training quantization is essential for deploying Large Language Models (LLMs) on resource‑constrained devices. However, standard integer quantization (e.g., INT4) fundamentally degrades performance by imposing a uniform grid on the heavy‑tailed distribution of weight parameters, particularly in smaller‑scale models (e.g., <2B parameters). We introduce HAS‑VQ (Hessian‑Adaptive Sparse Vector Quantization), a compression framework that strictly decouples high‑sensitivity outliers from the bulk weight distribution using second‑order sensitivity analysis. HAS‑VQ employs a Hessian‑Masked Decoupling strategy to isolate sensitive parameters, followed by robust Vector Quantization (VQ) of the remaining dense body. Crucially, we introduce a residual sparse feedback mechanism that corrects quantization errors in the most sensitive dimensions, ensuring exact reconstruction of outliers. We evaluate HAS‑VQ on SmolLM2‑1.7B, demonstrating two distinct regimes of superiority: (1) Pareto Dominance over Integer Baselines: At 4.23 effective bits‑per‑parameter (BPP), we achieve a perplexity of 14.23, significantly outperforming the standard INT4 baseline (20.03 PPL at 4.71 BPP). (2) High‑Fidelity Compression: Relative to the FP16 baseline, HAS‑VQ achieves a 2.3x reduction in model size (7.03 BPP) while maintaining statistically indistinguishable perplexity (10.12 vs. 10.04), effectively offering a lossless compression alternative for bandwidth‑constrained environments. The code is available at https://github.com/VladimerKhasia/HASVQ

Authors:Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang
Title: X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
Abstract:
Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real‑world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real‑world data. To support this, we leverage feature‑based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine‑tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X‑Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder‑14B‑Preview and AReal‑boba2‑14B despite having only 7B parameters. In‑depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code‑centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high‑quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real‑world coding data.

Authors:Shiyuan Zhang, Yilai Liu, Yuwei Du, Ruoxuan Yang, Dong In Kim, Hongyang Du
Title: U-MASK: User-adaptive Spatio-Temporal Masking for Personalized Mobile AI Applications
Abstract:
Personalized mobile artificial intelligence applications are widely deployed, yet they are expected to infer user behavior from sparse and irregular histories under a continuously evolving spatio‑temporal context. This setting induces a fundamental tension among three requirements, i.e., immediacy to adapt to recent behavior, stability to resist transient noise, and generalization to support long‑horizon prediction and cold‑start users. Most existing approaches satisfy at most two of these requirements, resulting in an inherent impossibility triangle in data‑scarce, non‑stationary personalization. To address this challenge, we model mobile behavior as a partially observed spatio‑temporal tensor and unify short‑term adaptation, long‑horizon forecasting, and cold‑start recommendation as a conditional completion problem, where a user‑ and task‑specific mask specifies which coordinates are treated as evidence. We propose U‑MASK, a user‑adaptive spatio‑temporal masking method that allocates evidence budgets based on user reliability and task sensitivity. To enable mask generation under sparse observations, U‑MASK learns a compact, task‑agnostic user representation from app and location histories via U‑SCOPE, which serves as the sole semantic conditioning signal. A shared diffusion transformer then performs mask‑guided generative completion while preserving observed evidence, so personalization and task differentiation are governed entirely by the mask and the user representation. Experiments on real‑world mobile datasets demonstrate consistent improvements over state‑of‑the‑art methods across short‑term prediction, long‑horizon forecasting, and cold‑start settings, with the largest gains under severe data sparsity. The code and dataset will be available at https://github.com/NICE‑HKU/U‑MASK.

Authors:Zhongping Ji
Title: CliffordNet: All You Need is Geometric Algebra
Abstract:
Modern computer vision architectures, from CNNs to Transformers, predominantly rely on the stacking of heuristic modules: spatial mixers (Attention/Conv) followed by channel mixers (FFNs). In this work, we challenge this paradigm by returning to mathematical first principles. We propose the Clifford Algebra Network (CAN), also referred to as CliffordNet, a vision backbone grounded purely in Geometric Algebra. Instead of engineering separate modules for mixing and memory, we derive a unified interaction mechanism based on the Clifford Geometric Product (uv = u \cdot v + u \wedge v). This operation ensures algebraic completeness regarding the Geometric Product by simultaneously capturing feature coherence (via the generalized inner product) and structural variation (via the exterior wedge product). Implemented via an efficient sparse rolling mechanism with strict linear complexity \mathcalO(N), our model reveals a surprising emergent property: the geometric interaction is so representationally dense that standard Feed‑Forward Networks (FFNs) become redundant. Empirically, CliffordNet establishes a new Pareto frontier: our Nano variant achieves 76.41% accuracy on CIFAR‑100 with only 1.4M parameters, effectively matching the heavy‑weight ResNet‑18 (11.2M) with 8× fewer parameters, while our Base variant sets a new SOTA for tiny models at 78.05%. Our results suggest that global understanding can emerge solely from rigorous, algebraically complete local interactions, potentially signaling a shift where geometry is all you need. Code is available at https://github.com/ParaMind2025/CAN.

Authors:Malavika Pradeep, Akshay Sasi, Nusaibah Farrukh, Rahul Venugopal, Elizabeth Sherly
Title: Cross-Modal Computational Model of Brain-Heart Interactions via HRV and EEG Feature
Abstract:
The electroencephalogram (EEG) has been the gold standard for quantifying mental workload; however, due to its complexity and non‑portability, it can be constraining. ECG signals, which are feasible on wearable equipment pieces such as headbands, present a promising method for cognitive state monitoring. This research explores whether electrocardiogram (ECG) signals are able to indicate mental workload consistently and act as surrogates for EEG‑based cognitive indicators. This study investigates whether ECG‑derived features can serve as surrogate indicators of cognitive load, a concept traditionally quantified using EEG. Using a publicly available multimodal dataset (OpenNeuro) of EEG and ECG recorded during working‑memory and listening tasks, features of HRV and Catch22 descriptors are extracted from ECG, and spectral band‑power with Catch22 features from EEG. A cross‑modal regression framework based on XGBoost was trained to map ECG‑derived HRV representations to EEG‑derived cognitive features. In order to address data sparsity and model brain‑heart interactions, we integrated the PSV‑SDG to produce EEG‑conditioned synthetic HRV time series.This addresses the challenge of inferring cognitive load solely from ECG‑derived features using a combination of multimodal learning, signal processing, and synthetic data generation. These outcomes form a basis for light, interpretable machine learning models that are implemented through wearable biosensors in non‑lab environments. Synthetic HRV inclusion enhances robustness, particularly in sparse data situations. Overall, this work is an initiation for building low‑cost, explainable, and real‑time cognitive monitoring systems for mental health, education, and human‑computer interaction, with a focus on ageing and clinical populations.

Authors:Juan Miguel López Alcaraz, Xicoténcatl López Moran, Erick Dávila Zaragoza, Claas Händel, Richard Koebe, Wilhelm Haverkamp, Nils Strodthoff
Title: A Multimodal Deep Learning Framework for Predicting ICU Deterioration: Integrating ECG Waveforms with Clinical Data and Clinician Benchmarking
Abstract:
Artificial intelligence holds strong potential to support clinical decision making in intensive care units where timely and accurate risk assessment is critical. However, many existing models focus on isolated outcomes or limited data types, while clinicians integrate longitudinal history, real time physiology, and heterogeneous clinical information. To address this gap, we developed MDS ICU, a unified multimodal machine learning framework that fuses routinely collected data including demographics, biometrics, vital signs, laboratory values, ECG waveforms, surgical procedures, and medical device usage to provide continuous predictive support during ICU stays. Using 63001 samples from 27062 patients in MIMIC IV, we trained a deep learning architecture that combines structured state space S4 encoders for ECG waveforms with multilayer perceptron RealMLP encoders for tabular data to jointly predict 33 clinically relevant outcomes spanning mortality, organ dysfunction, medication needs, and acute deterioration. The model achieved strong discrimination with AUROCs of 0.90 for 24 hour mortality, 0.92 for sedative administration, 0.97 for invasive mechanical ventilation, and 0.93 for coagulation dysfunction. Calibration analysis showed close agreement between predicted and observed risks, with consistent gains from ECG waveform integration. Comparisons with clinicians and large language models showed that model predictions alone outperformed both, and that providing model outputs as decision support further improved their performance. These results demonstrate that multimodal AI can deliver clinically meaningful risk stratification across diverse ICU outcomes while augmenting rather than replacing clinical expertise, establishing a scalable foundation for precision critical care decision support.

Authors:Carl Vincent Ladres Kho
Title: Pareto-Optimal Model Selection for Low-Cost, Single-Lead EMG Control in Embedded Systems
Abstract:
Consumer‑grade biosensors offer a cost‑effective alternative to medical‑grade electromyography (EMG) systems, reducing hardware costs from thousands of dollars to approximately 13. However, these low‑cost sensors introduce significant signal instability and motion artifacts. Deploying machine learning models on resource‑constrained edge devices like the ESP32 presents a challenge: balancing classification accuracy with strict latency (<100ms) and memory (<320KB) constraints. Using a single‑subject dataset comprising 1,540 seconds of raw data (1.54M data points, segmented into ~1,300 one‑second windows), I evaluate 18 model architectures, ranging from statistical heuristics to deep transfer learning (ResNet50) and custom hybrid networks (MaxCRNN). While my custom "MaxCRNN" (Inception + Bi‑LSTM + Attention) achieved the highest safety (99% Precision) and robustness, I identify Random Forest (74% accuracy) as the Pareto‑optimal solution for embedded control on legacy microcontrollers. I demonstrate that reliable, low‑latency EMG control is feasible on commodity hardware, with Deep Learning offering a path to near‑perfect reliability on modern Edge AI accelerators.

Authors:Sang T. Truong, Duc Q. Nguyen, Willie Neiswanger, Ryan-Rhys Griffiths, Stefano Ermon, Nick Haber, Sanmi Koyejo
Title: Neural Nonmyopic Bayesian Optimization in Dynamic Cost Settings
Abstract:
Bayesian optimization (BO) is a common framework for optimizing black‑box functions, yet most existing methods assume static query costs and rely on myopic acquisition strategies. We introduce LookaHES, a nonmyopic BO framework designed for dynamic, history‑dependent cost environments, where evaluation costs vary with prior actions, such as travel distance in spatial tasks or edit distance in sequence design. LookaHES combines a multi‑step variant of H‑Entropy Search with pathwise sampling and neural policy optimization, enabling long‑horizon planning beyond twenty steps without the exponential complexity of existing nonmyopic methods. The key innovation is the integration of neural policies, including large language models, to effectively navigate structured, combinatorial action spaces such as protein sequences. These policies amortize lookahead planning and can be integrated with domain‑specific constraints during rollout. Empirically, LookaHES outperforms strong myopic and nonmyopic baselines across nine synthetic benchmarks from two to eight dimensions and two real‑world tasks: geospatial optimization using NASA night‑light imagery and protein sequence design with constrained token‑level edits. In short, LookaHES provides a general, scalable, and cost‑aware solution for robust long‑horizon optimization in complex decision spaces, which makes it a useful tool for researchers in machine learning, statistics, and applied domains. Our implementation is available at https://github.com/sangttruong/nonmyopia.

Authors:Xuezhe Ma, Shicheng Wen, Linghao Jin, Bilge Acun, Ruihang Lai, Bohan Hou, Will Lin, Hao Zhang, Songlin Yang, Ryan Lee, Mengxi Wu, Jonathan May, Luke Zettlemoyer, Carole-Jean Wu
Title: Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths
Abstract:
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long‑context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2‑7B (1.75) and Megalodon‑7B (1.70), and landing close to Llama2‑13B (1.67). Notably, without relying on any context‑extension techniques, Gecko exhibits inherent long‑context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4× longer than its attention window. Code: https://github.com/XuezheMax/gecko‑llm

Authors:Anshul Kumar
Title: Is Sanskrit the most token-efficient language? A quantitative study using GPT, Gemini, and SentencePiece
Abstract:
Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages‑Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM baseline. English/Hindi translations of Sanskrit commentary resulted in an approximately 20x increase in token count. GPT o200k base (latest, used by GPT‑4o) and Gemini (latest) reduce bias by a significant degree compared to GPT cl100k base (used until GPT‑4), but still fail to fully capture Sanskrit's compactness. This matters because there might be a penalty bias for non‑English users, which inflates the token count. This research provides a foundation for improving future tokenizer design and shows the potential of Sanskrit for highly compact encoding, saving on cost while speeding up training and inference. The code and dataset are available at https://github.com/anshulkr713/sanskrit‑token‑efficiency

Authors:Ahmed H. Ismail, Anthony Kuang, Ayo Akinkugbe, Kevin Zhu, Sean O'Brien
Title: CBMAS: Cognitive Behavioral Modeling via Activation Steering
Abstract:
Large language models (LLMs) often encode cognitive behaviors unpredictably across prompts, layers, and contexts, making them difficult to diagnose and control. We present CBMAS, a diagnostic framework for continuous activation steering, which extends cognitive bias analysis from discrete before/after interventions to interpretable trajectories. By combining steering vector construction with dense α‑sweeps, logit lens‑based bias curves, and layer‑site sensitivity analysis, our approach can reveal tipping points where small intervention strengths flip model behavior and show how steering effects evolve across layer depth. We argue that these continuous diagnostics offer a bridge between high‑level behavioral evaluation and low‑level representational dynamics, contributing to the cognitive interpretability of LLMs. Lastly, we provide a CLI and datasets for various cognitive behaviors at the project repository, https://github.com/shimamooo/CBMAS.

Authors:Bohan Liang, Zijian Chen, Qi Jia, Kaiwei Zhang, Kaiyuan Ji, Guangtao Zhai
Title: PriceSeer: Evaluating Large Language Models in Real-Time Stock Prediction
Abstract:
Stock prediction, a subject closely related to people's investment activities in fully dynamic and live environments, has been widely studied. Current large language models (LLMs) have shown remarkable potential in various domains, exhibiting expert‑level performance through advanced reasoning and contextual understanding. In this paper, we introduce PriceSeer, a live, dynamic, and data‑uncontaminated benchmark specifically designed for LLMs performing stock prediction tasks. Specifically, PriceSeer includes 110 U.S. stocks from 11 industrial sectors, with each containing 249 historical data points. Our benchmark implements both internal and external information expansion, where LLMs receive extra financial indicators, news, and fake news to perform stock price prediction. We evaluate six cutting‑edge LLMs under different prediction horizons, demonstrating their potential in generating investment strategies after obtaining accurate price predictions for different sectors. Additionally, we provide analyses of LLMs' suboptimal performance in long‑term predictions, including the vulnerability to fake news and specific industries. The code and evaluation data will be open‑sourced at https://github.com/BobLiang2113/PriceSeer.

Authors:Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang
Title: Can We Predict Before Executing Machine Learning Agents?
Abstract:
Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate‑Execute‑Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data‑centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict‑then‑Verify loop, achieving a 6x acceleration in convergence while surpassing execution‑based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict‑before‑execute.

Authors:Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
Title: Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
Abstract:
As Large Language Models (LLMs) are increasingly deployed in real‑world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point‑wise confidence like Self‑Consistency, which can mask brittle belief. We show that even facts answered with perfect self‑consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor‑Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress‑testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high‑NCB data is relatively more resistant to interference. Finally, we present Structure‑Aware Training (SAT), which optimizes context‑invariant belief structure and reduces long‑tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.

Authors:Alexandra Dragomir, Florin Brad, Radu Tudor Ionescu
Title: CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
Abstract:
Large language models (LLMs) have demonstrated competitive performance in zero‑shot multilingual machine translation (MT). Some follow‑up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state‑of‑the‑art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy‑to‑hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra‑dragomir/CLewR.

Authors:ChunTeng Chen, YiChen Hsu, YiWen Liu, WeiFang Sun, TsaiChing Ni, ChunYi Lee, Min Sun, YuanFu Yang
Title: SceneFoundry: Generating Interactive Infinite 3D Worlds
Abstract:
The ability to automatically generate large‑scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real‑world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language‑guided diffusion framework that generates apartment‑scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion‑based posterior sampling efficiently populates the scene with articulated assets from large‑scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research. project page: https://anc891203.github.io/SceneFoundry‑Demo/

Authors:Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Title: EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Abstract:
Large language models (LLMs) are expected to be trained to act as agents in various real‑world environments, but this process relies on rich and varied tool‑interaction sandboxes. However, access to real systems is often restricted; LLM‑simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool‑interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule‑based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi‑turn, multi‑tool interactions. We release our code and data at https://github.com/RUC‑NLPIR/EnvScaler.

Authors:Yongyi Yang, Jianyang Gao
Title: mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations
Abstract:
Hyper‑Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold‑Constrained Hyper‑Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn‑‑Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff‑‑von Neumann theorem, we propose mHC‑lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC‑lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc‑lite.

Authors:George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
Title: Falsifying Sparse Autoencoder Reasoning Features in Language Models
Abstract:
We study how reliably sparse autoencoders (SAEs) support claims about reasoning‑related internal features in large language models. We first give a stylized analysis showing that sparsity‑regularized decoding can preferentially retain stable low‑dimensional correlates while suppressing high‑dimensional within‑behavior variation, motivating the possibility that contrastively selected "reasoning" features may concentrate on cue‑like structure when such cues are coupled with reasoning traces. Building on this perspective, we propose a falsification‑based evaluation framework that combines causal token injection with LLM‑guided counterexample construction. Across 22 configurations spanning multiple model families, layers, and reasoning datasets, we find that many contrastively selected candidates are highly sensitive to token‑level interventions, with 45%‑90% activating after injecting only a few associated tokens into non‑reasoning text. For the remaining context‑dependent candidates, LLM‑guided falsification produces targeted non‑reasoning inputs that trigger activation and meaning‑preserving paraphrases of top‑activating reasoning traces that suppress it. A small steering study yields minimal changes on the evaluated benchmarks. Overall, our results suggest that, in the settings we study, sparse decompositions can favor low‑dimensional correlates that co‑occur with reasoning, underscoring the need for falsification when attributing high‑level behaviors to individual SAE features. Code is available at https://github.com/GeorgeMLP/reasoning‑probing.

Authors:Shufei Ge, Shijia Wang, Lloyd Elliott
Title: Poisson Hyperplane Processes with Rectified Linear Units
Abstract:
Neural networks have shown state‑of‑the‑art performances in various classification and regression tasks. Rectified linear units (ReLU) are often used as activation functions for the hidden layers in a neural network model. In this article, we establish the connection between the Poisson hyperplane processes (PHP) and two‑layer ReLU neural networks. We show that the PHP with a Gaussian prior is an alternative probabilistic representation to a two‑layer ReLU neural network. In addition, we show that a two‑layer neural network constructed by PHP is scalable to large‑scale problems via the decomposition propositions. Finally, we propose an annealed sequential Monte Carlo algorithm for Bayesian inference. Our numerical experiments demonstrate that our proposed method outperforms the classic two‑layer ReLU neural network. The implementation of our proposed model is available at https://github.com/ShufeiGe/Pois_Relu.git.

Authors:Marko Sterbentz, Kevin Cushing, Cameron Barrie, Kristian J. Hammond
Title: RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models
Abstract:
Recent advances in text‑to‑SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high‑quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template‑based approaches ensure correct SQL but require schema‑specific templates, while LLM‑based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema‑independent query templates with LLM‑based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text‑to‑SQL benchmarks when compared to models trained on other synthetic data. We make our code available at https://github.com/nu‑c3lab/RingSQL.

Authors:Yiqun T Chen, Sizhu Lu, Sijia Li, Moran Guo, Shengyi Li
Title: Efficient Inference for Noisy LLM-as-a-Judge Evaluation
Abstract:
Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM‑as‑a‑judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non‑random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan‑Gladen‑style estimators, and (ii) surrogate‑outcome approaches such as prediction‑powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold‑standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)‑based efficient estimators and characterize conditions under which PPI‑style estimators attain strictly smaller asymptotic variance than measurement‑error corrections. We verify our theoretical results in simulations and demonstrate the methods on real‑data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias‑llm‑as‑a‑judge.

Authors:Qiao Liu, Wing Hung Wong
Title: A Bayesian Generative Modeling Approach for Arbitrary Conditional Inference
Abstract:
Modern data analysis increasingly requires flexible conditional inference P(X_B | X_A) where (X_A, X_B) is an arbitrary partition of observed variable X. Existing conditional inference methods lack this flexibility as they are tied to a fixed conditioning structure and cannot perform new conditional inference once trained. To solve this, we propose a Bayesian generative modeling (BGM) approach for arbitrary conditional inference without retraining. BGM learns a generative model of X through an iterative Bayesian updating algorithm where model parameters and latent variables are updated until convergence. Once trained, any conditional distribution can be obtained without retraining. Empirically, BGM achieves superior prediction performance with well calibrated predictive intervals, demonstrating that a single learned model can serve as a universal engine for conditional prediction with uncertainty quantification. We provide theoretical guarantees for the convergence of the stochastic iterative algorithm, statistical consistency and conditional‑risk bounds. The proposed BGM framework leverages the power of AI to capture complex relationships among variables while adhering to Bayesian principles, emerging as a promising framework for advancing various applications in modern data science. The code for BGM is freely available at https://github.com/liuq‑lab/bayesgm.

Authors:Susmit Das
Title: TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning
Abstract:
Reasoning oriented large language models often expose explicit "thinking" as long, turn‑global traces at the start of every response, either always on or toggled externally at inference time. While useful for arithmetic, programming, and problem solving, this design is costly, blurs claim level auditability, and cannot re‑trigger explicit reasoning once the model begins presenting. Dialogue models are also largely blind to temporal structure, treating replies after seconds and replies after weeks as equivalent unless time is stated in text. We introduce TIME, the Temporally Intelligent Meta‑reasoning Engine, a behavioral alignment framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues. TIME augments dialogue with optional ISO 8601 <time> tags, tick turns that represent silent gaps, and short <think> blocks that can appear anywhere in a reply. A four‑phase curriculum including a small, maximally diverse full‑batch alignment step trains Qwen3 dense models to invoke brief, in‑place reasoning bursts and keep user facing text compact. We evaluate with TIMEBench, a temporally grounded dialogue benchmark probing chronology, commonsense under gaps and offsets, anomaly detection, and continuity. Across 4B to 32B scales, TIME improves TIMEBench scores over base Qwen3 in both thinking and no‑thinking modes while reducing reasoning tokens by about an order of magnitude. Our training data and code are available at https://github.com/The‑Coherence‑Initiative/TIME and TIMEBench is available at https://github.com/The‑Coherence‑Initiative/TIMEBench

Authors:Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu
Title: FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts
Abstract:
Spatial‑Temporal Graph (STG) forecasting on large‑scale networks has garnered significant attention. However, existing models predominantly focus on short‑horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long‑horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity‑aware Mixture‑of‑Experts (MoEs) for long‑horizon and large‑scale STG forecasting, which unlocks one‑week‑ahead (672 steps at a 15‑minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self‑attention modules when applied to large‑scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed‑forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real‑world datasets demonstrate that FaST not only delivers superior long‑horizon predictive accuracy but also achieves remarkable computational efficiency compared to state‑of‑the‑art baselines. Our source code is available at: https://github.com/yijizhao/FaST.

Authors:Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia
Title: CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters
Abstract:
As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from Mean Collapse, converging to a generic average that fails to represent diverse groups. We attribute this to Cultural Sparsity, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textscCuMA (Cultural Mixture of Adapters), a framework that frames alignment as a conditional capacity separation problem. By incorporating demographic‑aware routing, \textscCuMA internalizes a Latent Cultural Topology to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textscCuMA achieves state‑of‑the‑art performance, significantly outperforming both dense baselines and semantic‑only MoEs. Crucially, our analysis confirms that \textscCuMA effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

Authors:Matan Kleiner, Lior Michaeli, Tomer Michaeli
Title: Illumination Angular Spectrum Encoding for Controlling the Functionality of Diffractive Networks
Abstract:
Diffractive neural networks have recently emerged as a promising framework for all‑optical computing. However, these networks are typically trained for a single task, limiting their potential adoption in systems requiring multiple functionalities. Existing approaches to achieving multi‑task functionality either modify the mechanical configuration of the network per task or use a different illumination wavelength or polarization state for each task. In this work, we propose a new control mechanism, which is based on the illumination's angular spectrum. Specifically, we shape the illumination using an amplitude mask that selectively controls its angular spectrum. We employ different illumination masks for achieving different network functionalities, so that the mask serves as a unique task encoder. Interestingly, we show that effective control can be achieved over a very narrow angular range, within the paraxial regime. We numerically illustrate the proposed approach by training a single diffractive network to perform multiple image‑to‑image translation tasks. In particular, we demonstrate translating handwritten digits into typeset digits of different values, and translating handwritten English letters into typeset numbers and typeset Greek letters, where the type of the output is determined by the illumination's angular components. As we show, the proposed framework can work under different coherence conditions, and can be combined with existing control strategies, such as different wavelengths. Our results establish the illumination angular spectrum as a powerful degree of freedom for controlling diffractive networks, enabling a scalable and versatile framework for multi‑task all‑optical computing.

Authors:Lei Xu, Shanshan Wang, Chenglong Xiao
Title: MPM-LLM4DSE: Reaching the Pareto Frontier in HLS with Multimodal Learning and LLM-Driven Exploration
Abstract:
High‑Level Synthesis (HLS) design space exploration (DSE) seeks Pareto‑optimal designs within expansive pragma configuration spaces. To accelerate HLS DSE, graph neural networks (GNNs) are commonly employed as surrogates for HLS tools to predict quality of results (QoR) metrics, while multi‑objective optimization algorithms expedite the exploration. However, GNN‑based prediction methods may not fully capture the rich semantic features inherent in behavioral descriptions, and conventional multi‑objective optimization algorithms often do not explicitly account for the domain‑specific knowledge regarding how pragma directives influence QoR. To address these limitations, this paper proposes the MPM‑LLM4DSE framework, which incorporates a multimodal prediction model (MPM) that simultaneously fuses features from behavioral descriptions and control and data flow graphs. Furthermore, the framework employs a large language model (LLM) as an optimizer, accompanied by a tailored prompt engineering methodology. This methodology incorporates pragma impact analysis on QoR to guide the LLM in generating high‑quality configurations (LLM4DSE). Experimental results demonstrate that our multimodal predictive model significantly outperforms state‑of‑the‑art work ProgSG by up to 10.25×. Furthermore, in DSE tasks, the proposed LLM4DSE achieves an average performance gain of 39.90% over prior methods, validating the effectiveness of our prompting methodology. Code and models are available at https://github.com/wslcccc/MPM‑LLM4DSE.

Authors:Marios Thoma, Vassilis Vassiliades, Loizos Michael
Title: Neural-Symbolic Integration with Evolvable Policies
Abstract:
Neural‑Symbolic (NeSy) Artificial Intelligence has emerged as a promising approach for combining the learning capabilities of neural networks with the interpretable reasoning of symbolic systems. However, existing NeSy frameworks typically require either predefined symbolic policies or policies that are differentiable, limiting their applicability when domain expertise is unavailable or when policies are inherently non‑differentiable. We propose a framework that addresses this limitation by enabling the concurrent learning of both non‑differentiable symbolic policies and neural network weights through an evolutionary process. Our approach casts NeSy systems as organisms in a population that evolve through mutations (both symbolic rule additions and neural weight changes), with fitness‑based selection guiding convergence toward hidden target policies. The framework extends the NEUROLOG architecture to make symbolic policies trainable, adapts Valiant's Evolvability framework to the NeSy context, and employs Machine Coaching semantics for mutable symbolic representations. Neural networks are trained through abductive reasoning from the symbolic component, eliminating differentiability requirements. Through extensive experimentation, we demonstrate that NeSy systems starting with empty policies and random neural weights can successfully approximate hidden non‑differentiable target policies, achieving median correct performance approaching 100%. This work represents a step toward enabling NeSy research in domains where the acquisition of symbolic knowledge from experts is challenging or infeasible.

Authors:Quang-Tu Pham, Hoang-Dieu Vu, Dinh-Dat Pham, Hieu H. Pham
Title: FedKDX: Federated Learning with Negative Knowledge Distillation for Enhanced Healthcare AI Systems
Abstract:
This paper introduces FedKDX, a federated learning framework that addresses limitations in healthcare AI through Negative Knowledge Distillation (NKD). Unlike existing approaches that focus solely on positive knowledge transfer, FedKDX captures both target and non‑target information to improve model generalization in healthcare applications. The framework integrates multiple knowledge transfer techniques‑‑including traditional knowledge distillation, contrastive learning, and NKD‑‑within a unified architecture that maintains privacy while reducing communication costs. Through experiments on healthcare datasets (SLEEP, UCI‑HAR, and PAMAP2), FedKDX demonstrates improved accuracy (up to 2.53% over state‑of‑the‑art methods), faster convergence, and better performance on non‑IID data distributions. Theoretical analysis supports NKD's contribution to addressing statistical heterogeneity in distributed healthcare data. The approach shows promise for privacy‑sensitive medical applications under regulatory frameworks like HIPAA and GDPR, offering a balanced solution between performance and practical implementation requirements in decentralized healthcare settings. The code and model are available at https://github.com/phamdinhdat‑ai/Fed_2024.

Authors:Paul Pu Liang
Title: A Vision for Multisensory Intelligence: Sensing, Science, and Synergy
Abstract:
Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross‑modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit‑mi.github.io/.

Authors:Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao
Title: Not All Steps are Informative: On the Linearity of LLMs' RLVR Training
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post‑training. Unlike supervised fine‑tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log‑probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on mathematics and code benchmarks by extrapolating beyond the step range where RL training remains stable. Our code is available at https://github.com/Miaow‑Lab/RLVR‑Linearity

Authors:Iaroslav Chelombitko, Ekaterina Chelombitko, Aleksey Komissarov
Title: SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
Abstract:
The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus‑free toolkit for morphological lexicon creation using MDL‑inspired Self‑Referential Atomicity Scoring, which filters composite forms through internal structural cues ‑ suited for low‑resource settings. Using the high‑purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k‑256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade‑off between morpheme coverage and over‑splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP

Authors:Mustapha Hamdi, Mourad Jabou
Title: Green MLOps: Closed-Loop, Energy-Aware Inference with NVIDIA Triton, FastAPI, and Bio-Inspired Thresholding
Abstract:
Energy efficiency is a first‑order concern in AI deployment, as long‑running inference can exceed training in cumulative carbon impact. We propose a bio‑inspired framework that maps protein‑folding energy basins to inference cost landscapes and controls execution via a decaying, closed‑loop threshold. A request is admitted only when the expected utility‑to‑energy trade‑off is favorable (high confidence/utility at low marginal energy and congestion), biasing operation toward the first acceptable local basin rather than pursuing costly global minima. We evaluate DistilBERT and ResNet‑18 served through FastAPI with ONNX Runtime and NVIDIA Triton on an RTX 4000 Ada GPU. Our ablation study reveals that the bio‑controller reduces processing time by 42% compared to standard open‑loop execution (0.50s vs 0.29s on A100 test set), with a minimal accuracy degradation (<0.5%). Furthermore, we establish the efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton). The results connect biophysical energy models to Green MLOps and offer a practical, auditable basis for closed‑loop energy‑aware inference in production.

Authors:Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen
Title: FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
Abstract:
We present FronTalk, a benchmark for front‑end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi‑modal feedback. In front‑end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi‑turn code generation remains largely unexplored. To address this gap, we focus on the front‑end development task and curate FronTalk, a collection of 100 multi‑turn dialogues derived from real‑world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent‑based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under‑explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open‑source vision‑language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front‑end development and the general interaction dynamics of multi‑turn, multi‑modal code generation. Code and data are released at https://github.com/shirley‑wu/frontalk

Authors:Pietro de Oliveira Esteves
Title: Robust Physics Discovery from Highly Corrupted Data: A PINN Framework Applied to the Nonlinear Schrödinger Equation
Abstract:
We demonstrate a deep learning framework capable of recovering physical parameters from the Nonlinear Schrodinger Equation (NLSE) under severe noise conditions. By integrating Physics‑Informed Neural Networks (PINNs) with automatic differentiation, we achieve reconstruction of the nonlinear coefficient beta with less than 0.2 percent relative error using only 500 sparse, randomly sampled data points corrupted by 20 percent additive Gaussian noise, a regime where traditional finite difference methods typically fail due to noise amplification in numerical derivatives. We validate the method's generalization capabilities across different physical regimes (beta between 0.5 and 2.0) and varying data availability (between 100 and 1000 training points), demonstrating consistent sub‑1 percent accuracy. Statistical analysis over multiple independent runs confirms robustness (standard deviation less than 0.15 percent for beta equals 1.0). The complete pipeline executes in approximately 80 minutes on modest cloud GPU resources (NVIDIA Tesla T4), making the approach accessible for widespread adoption. Our results indicate that physics‑based regularization acts as an effective filter against high measurement uncertainty, positioning PINNs as a viable alternative to traditional optimization methods for inverse problems in spatiotemporal dynamics where experimental data is scarce and noisy. All code is made publicly available to facilitate reproducibility.

Authors:Chi Liu, Xin Chen
Title: Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training
Abstract:
Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive‑Boundary‑Clipping GRPO (ABC‑GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC‑GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC‑GRPO maintains substantially higher entropy throughout training, thereby preserving the model's exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility https://github.com/chi2liu/ABC‑GRPO.

Authors:Benedikt Mayrhofer, Franz Pernkopf, Philipp Aichinger, Martin Hagmüller
Title: Lightweight and perceptually-guided voice conversion for electro-laryngeal speech
Abstract:
Electro‑laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state‑of‑the‑art StreamVC framework to this setting by removing pitch and energy modules and combining self‑supervised pretraining with supervised fine‑tuning on parallel EL and healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human‑feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground‑truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.

Authors:Jan Tagscherer, Sarah de Boer, Lena Philipp, Fennie van der Graaf, Dré Peeters, Joeran Bosma, Lars Leijten, Bogdan Obreja, Ewoud Smit, Alessa Hering
Title: EvalBlocks: A Modular Pipeline for Rapidly Evaluating Foundation Models in Medical Imaging
Abstract:
Developing foundation models in medical imaging requires continuous monitoring of downstream performance. Researchers are burdened with tracking numerous experiments, design choices, and their effects on performance, often relying on ad‑hoc, manual workflows that are inherently slow and error‑prone. We introduce EvalBlocks, a modular, plug‑and‑play framework for efficient evaluation of foundation models during development. Built on Snakemake, EvalBlocks supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies. All experiments and results are tracked centrally and are reproducible with a single command, while efficient caching and parallel execution enable scalable use on shared compute infrastructure. Demonstrated on five state‑of‑the‑art foundation models and three medical imaging classification tasks, EvalBlocks streamlines model evaluation, enabling researchers to iterate faster and focus on model innovation rather than evaluation logistics. The framework is released as open source software at https://github.com/DIAGNijmegen/eval‑blocks.

Authors:Arpad Berta, Gabor Danner, Istvan Hegedus, Mark Jelasity
Title: Detecting Semantic Backdoors in a Mystery Shopping Scenario
Abstract:
Detecting semantic backdoors in classification models‑‑where some classes can be activated by certain natural, but out‑of‑distribution inputs‑‑is an important problem that has received relatively little attention. Semantic backdoors are significantly harder to detect than backdoors that are based on trigger patterns due to the lack of such clearly identifiable patterns. We tackle this problem under the assumption that the clean training dataset and the training recipe of the model are both known. These assumptions are motivated by a consumer protection scenario, in which the responsible authority performs mystery shopping to test a machine learning service provider. In this scenario, the authority uses the provider's resources and tools to train a model on a given dataset and tests whether the provider included a backdoor. In our proposed approach, the authority creates a reference model pool by training a small number of clean and poisoned models using trusted infrastructure, and calibrates a model distance threshold to identify clean models. We propose and experimentally analyze a number of approaches to compute model distances and we also test a scenario where the provider performs an adaptive attack to avoid detection. The most reliable method is based on requesting adversarial training from the provider. The model distance is best measured using a set of input samples generated by inverting the models in such a way as to maximize the distance from clean samples. With these settings, our method can often completely separate clean and poisoned models, and it proves to be superior to state‑of‑the‑art backdoor detectors as well.

Authors:Sethupathy Parameswaran, Suresh Sundaram, Yuan Fang
Title: Prompt Tuning without Labeled Samples for Zero-Shot Node Classification in Text-Attributed Graphs
Abstract:
Node classification is a fundamental problem in information retrieval with many real‑world applications, such as community detection in social networks, grouping articles published online and product categorization in e‑commerce. Zero‑shot node classification in text‑attributed graphs (TAGs) presents a significant challenge, particularly due to the absence of labeled data. In this paper, we propose a novel Zero‑shot Prompt Tuning (ZPT) framework to address this problem by leveraging a Universal Bimodal Conditional Generator (UBCG). Our approach begins with pre‑training a graph‑language model to capture both the graph structure and the associated textual descriptions of each node. Following this, a conditional generative model is trained to learn the joint distribution of nodes in both graph and text modalities, enabling the generation of synthetic samples for each class based solely on the class name. These synthetic node and text embeddings are subsequently used to perform continuous prompt tuning, facilitating effective node classification in a zero‑shot setting. Furthermore, we conduct extensive experiments on multiple benchmark datasets, demonstrating that our framework performs better than existing state‑of‑the‑art baselines. We also provide ablation studies to validate the contribution of the bimodal generator. The code is provided at: https://github.com/Sethup123/ZPT.

Authors:Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li
Title: R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification
Abstract:
Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory‑level rewards penalize valid prefixes for later errors, and failure‑dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R^3L, Reflect‑then‑Retry Reinforcement Learning with Language‑Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high‑quality trajectories, R^3L shifts from stochastic sampling to active synthesis via reflect‑then‑retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect‑then‑retry produces off‑policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5% to 52% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

Authors:Wajid Arshad Abbasi, Syed Ali Abbas, Maryum Bibi, Saiqa Andleeb, Muhammad Naveed Akhtar
Title: Investigating Knowledge Distillation Through Neural Networks for Protein Binding Affinity Prediction
Abstract:
The trade‑off between predictive accuracy and data availability makes it difficult to predict protein‑‑protein binding affinity accurately. The lack of experimentally resolved protein structures limits the performance of structure‑based machine learning models, which generally outperform sequence‑based methods. In order to overcome this constraint, we suggest a regression framework based on knowledge distillation that uses protein structural data during training and only needs sequence data during inference. The suggested method uses binding affinity labels and intermediate feature representations to jointly supervise the training of a sequence‑based student network under the guidance of a structure‑informed teacher network. Leave‑One‑Complex‑Out (LOCO) cross‑validation was used to assess the framework on a non‑redundant protein‑‑protein binding affinity benchmark dataset. A maximum Pearson correlation coefficient (P_r) of 0.375 and an RMSE of 2.712 kcal/mol were obtained by sequence‑only baseline models, whereas a P_r of 0.512 and an RMSE of 2.445 kcal/mol were obtained by structure‑based models. With a P_r of 0.481 and an RMSE of 2.488 kcal/mol, the distillation‑based student model greatly enhanced sequence‑only performance. Improved agreement and decreased bias were further confirmed by thorough error analyses. With the potential to close the performance gap between sequence‑based and structure‑based models as larger datasets become available, these findings show that knowledge distillation is an efficient method for transferring structural knowledge to sequence‑based predictors. The source code for running inference with the proposed distillation‑based binding affinity predictor can be accessed at https://github.com/wajidarshad/ProteinAffinityKD.

Authors:Joshua Salako
Title: Latent Geometry of Taste: Scalable Low-Rank Matrix Factorization for Recommender Systems
Abstract:
Scalability and data sparsity remain critical bottlenecks for collaborative filtering on massive interaction datasets. This work investigates the latent geometry of user preferences using the MovieLens 32M dataset, implementing a high‑performance, parallelized Alternating Least Squares (ALS) framework. Through extensive hyperparameter optimization, we demonstrate that constrained low‑rank models significantly outperform higher dimensional counterparts in generalization, achieving an optimal balance between Root Mean Square Error (RMSE) and ranking precision. We visualize the learned embedding space to reveal the unsupervised emergence of semantic genre clusters, confirming that the model captures deep structural relationships solely from interaction data. Finally, we validate the system's practical utility in a cold‑start scenario, introducing a tunable scoring parameter to manage the trade‑off between popularity bias and personalized affinity effectively. The codebase for this research can be found here: https://github.com/joshsalako/recommender.git

Authors:Carles Balsells-Rodas, Toshiko Matsui, Pedro A. M. Mediano, Yixin Wang, Yingzhen Li
Title: On the Identifiability of Regime-Switching Models with Multi-Lag Dependencies
Abstract:
Identifiability is central to the interpretability of deep latent variable models, ensuring parameterisations are uniquely determined by the data‑generating distribution. However, it remains underexplored for deep regime‑switching time series. We develop a general theoretical framework for multi‑lag Regime‑Switching Models (RSMs), encompassing Markov Switching Models (MSMs) and Switching Dynamical Systems (SDSs). For MSMs, we formulate the model as a temporally structured finite mixture and prove identifiability of both the number of regimes and the multi‑lag transitions in a nonlinear‑Gaussian setting. For SDSs, we establish identifiability of the latent variables up to permutation and scaling via temporal structure, which in turn yields conditions for identifiability of regime‑dependent latent causal graphs (up to regime/node permutations). Our results hold in a fully unsupervised setting through architectural and noise assumptions that are directly enforceable via neural network design. We complement the theory with a flexible variational estimator that satisfies the assumptions and validate the results on synthetic benchmarks. Across real‑world datasets from neuroscience, finance, and climate, identifiability leads to more trustworthy interpretability analysis, which is crucial for scientific discovery.

Authors:Bugra Kilictas, Faruk Alpay
Title: Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64
Abstract:
The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high‑level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand‑tuned NEON SIMD kernels, we achieve a form of "Software‑Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero‑copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general‑purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.

Authors:Dhruv Trehan, Paras Chopra
Title: Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts
Abstract:
We report a case study of four end‑to‑end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi‑AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long‑horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI‑scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai‑scientist‑artefacts‑v1

Authors:Scott Thornton
Title: TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering
Abstract:
Large language models remain vulnerable to jailbreak attacks, and single‑layer defenses often trade security for usability. We present TRYLOCK, the first defense‑in‑depth architecture that combines four heterogeneous mechanisms across the inference stack: weight‑level safety alignment via DPO, activation‑level control via Representation Engineering (RepE) steering, adaptive steering strength selected by a lightweight sidecar classifier, and input canonicalization to neutralize encoding‑based bypasses. On Mistral‑7B‑Instruct evaluated against a 249‑prompt attack set spanning five attack families, TRYLOCK achieves 88.0% relative ASR reduction (46.5% to 5.6%), with each layer contributing unique coverage: RepE blocks 36% of attacks that bypass DPO alone, while canonicalization catches 14% of encoding attacks that evade both. We discover a non‑monotonic steering phenomenon ‑‑ intermediate strength (alpha=1.0) degrades safety below baseline ‑‑ and provide mechanistic hypotheses explaining RepE‑DPO interference. The adaptive sidecar reduces over‑refusal from 60% to 48% while maintaining identical attack defense, demonstrating that security and usability need not be mutually exclusive. We release all components ‑‑ trained adapters, steering vectors, sidecar classifier, preference pairs, and complete evaluation methodology ‑‑ enabling full reproducibility.

Authors:Gabriel Benedict, Matthew Butler, Naved Merchant, Eetu Salama-Laine
Title: WRAVAL -- WRiting Assist eVALuation
Abstract:
The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem‑solving tasks as measures of general intelligence. Small Language Models (SLMs) ‑‑ defined here as models under 10B parameters ‑‑ typically score 3‑4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs' effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non‑reasoning tasks where predefined evaluation datasets don't exist. Our framework combines novel approaches in data generation, prompt‑tuning, and LLM‑based evaluation to demonstrate the potential of task‑specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: https://github.com/amazon‑science/wraval.

Authors:Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert
Title: AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Abstract:
Multimodal medical large language models have shown impressive progress in chest X‑ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X‑ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://github.com/aneesurhashmi/anatomix

Authors:Joseph Kampeas, Emir Haleva
Title: Joint Encoding of KV-Cache Blocks for Scalable LLM Serving
Abstract:
Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory‑heavy growth of key‑value (KV) caches, which limits real‑time throughput under concurrent loads. Existing KV‑cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV‑cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV‑cache memory bottleneck, supporting high‑concurrency serving without specialized hardware. Theoretically, we analyze the rate‑distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38 × KV‑cache compression with negligible accuracy loss across diverse LLMs and benchmarks, outperforming recent structured and adaptive compression baselines. In real LLM serving, joint encoding improves the token throughput by ~40% on a single‑machine vLLM benchmark, demonstrating substantial gains in inference throughput. Code is available at https://github.com/sef1/kv_fast_fusion kv_joint_encoding.

Authors:Vidhi Rathore, Sambu Aneesh, Himanshu Singh
Title: Temporal Graph Network: Hallucination Detection in Multi-Turn Conversation
Abstract:
Hallucinations can be produced by conversational AI systems, particularly in multi‑turn conversations where context changes and contradictions may eventually surface. By representing the entire conversation as a temporal graph, we present a novel graph‑based method for detecting dialogue‑level hallucinations. Our framework models each dialogue as a node, encoding it using a sentence transformer. We explore two different ways of connectivity: i) shared‑entity edges, which connect turns that refer to the same entities; ii) temporal edges, which connect contiguous turns in the conversation. Message‑passing is used to update the node embeddings, allowing flow of information between related nodes. The context‑aware node embeddings are then combined using attention pooling into a single vector, which is then passed on to a classifier to determine the presence and type of hallucinations. We demonstrate that our method offers slightly improved performance over existing methods. Further, we show the attention mechanism can be used to justify the decision making process. The code and model weights are made available at: https://github.com/sambuaneesh/anlp‑project.

Authors:Youngjoon Jeong, Junha Chun, Taesup Kim
Title: Learning to Act Robustly with View-Invariant Latent Actions
Abstract:
Vision‑based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view‑invariant visual representations. This challenge becomes more pronounced in real‑world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi‑view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View‑Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view‑invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action‑guided objective based on ground‑truth action sequences. Experiments in both simulation and the real world show that VILA‑based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.

Authors:Arjun S. Nair
Title: Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth
Abstract:
Large language model fine‑tuning is bottlenecked by memory: a 7B parameter model requires 84GB‑‑14GB for weights, 14GB for gradients, and 56GB for FP32 optimizer states‑‑exceeding even A100‑40GB capacity. We present Chronicals, an open‑source training framework achieving 3.51x speedup over Unsloth through four synergistic optimizations: (1) fused Triton kernels eliminating 75% of memory traffic via RMSNorm (7x), SwiGLU (5x), and QK‑RoPE (2.3x) fusion; (2) Cut Cross‑Entropy reducing logit memory from 5GB to 135MB through online softmax computation; (3) LoRA+ with theoretically‑derived 16x differential learning rates between adapter matrices; and (4) Best‑Fit Decreasing sequence packing recovering 60‑75% of compute wasted on padding. On Qwen2.5‑0.5B with A100‑40GB, Chronicals achieves 41,184 tokens/second for full fine‑tuning versus Unsloth's 11,736 tokens/second (3.51x). For LoRA at rank 32, we reach 11,699 tokens/second versus Unsloth MAX's 2,857 tokens/second (4.10x). Critically, we discovered that Unsloth's reported 46,000 tokens/second benchmark exhibited zero gradient norms‑‑the model was not training. We provide complete mathematical foundations: online softmax correctness proofs, FlashAttention IO complexity bounds O(N^2 d^2 M^‑1), LoRA+ learning rate derivations from gradient magnitude analysis, and bin‑packing approximation guarantees. All implementations, benchmarks, and proofs are available at https://github.com/Ajwebdevs/Chronicals with pip installation via https://pypi.org/project/chronicals/.

Authors:Fabio Cumbo, Kabir Dhillon, Daniel Blankenberg
Title: hdlib 2.0: Extending Machine Learning Capabilities of Vector-Symbolic Architectures
Abstract:
Following the initial publication of hdlib, a Python library for designing Vector‑Symbolic Architectures (VSA), we introduce a major extension that significantly enhances its machine learning capabilities. VSA, also known as Hyperdimensional Computing, is a computing paradigm that represents and processes information using high‑dimensional vectors. While the first version of hdlib established a robust foundation for creating and manipulating these vectors, this update addresses the growing need for more advanced, data‑driven modeling within the VSA framework. Here, we present four extensions: significant enhancements to the existing supervised classification model also enabling feature selection, and a new regression model for predicting continuous variables, a clustering model for unsupervised learning, and a graph‑based learning model. Furthermore, we propose the first implementation ever of Quantum Hyperdimensional Computing with quantum‑powered arithmetic operations and a new Quantum Machine Learning model for supervised learning. hdlib remains open‑source and available on GitHub at https://github.com/cumbof/hdlib under the MIT license, and distributed through the Python Package Index (pip install hdlib) and Conda (conda install c conda‑forge hdlib). Documentation and examples of these new features are available on the official Wiki at https://github.com/cumbof/hdlib/wiki.

Authors:Subhankar Mishra
Title: mHC-GNN: Manifold-Constrained Hyper-Connections for Graph Neural Networks
Abstract:
Graph Neural Networks (GNNs) suffer from over‑smoothing in deep architectures and expressiveness bounded by the 1‑Weisfeiler‑Leman (1‑WL) test. We adapt Manifold‑Constrained Hyper‑Connections (\mhc)~\citepxie2025mhc, recently proposed for Transformers, to graph neural networks. Our method, mHC‑GNN, expands node representations across n parallel streams and constrains stream‑mixing matrices to the Birkhoff polytope via Sinkhorn‑Knopp normalization. We prove that mHC‑GNN exhibits exponentially slower over‑smoothing (rate (1‑γ)^L/n vs.\ (1‑γ)^L) and can distinguish graphs beyond 1‑WL. Experiments on 10 datasets with 4 GNN architectures show consistent improvements. Depth experiments from 2 to 128 layers reveal that standard GNNs collapse to near‑random performance beyond 16 layers, while mHC‑GNN maintains over 74% accuracy even at 128 layers, with improvements exceeding 50 percentage points at extreme depths. Ablations confirm that the manifold constraint is essential: removing it causes up to 82% performance degradation. Code is available at \hrefhttps://github.com/smlab‑niser/mhc‑gnnhttps://github.com/smlab‑niser/mhc‑gnn

Authors:Buqing Cao, Qian Peng, Xiang Xie, Liang Chen, Min Shi, Jianxun Liu
Title: Spiking Heterogeneous Graph Attention Networks
Abstract:
Real‑world graphs or networks are usually heterogeneous, involving multiple types of nodes and relationships. Heterogeneous graph neural networks (HGNNs) can effectively handle these diverse nodes and edges, capturing heterogeneous information within the graph, thus exhibiting outstanding performance. However, most methods of HGNNs usually involve complex structural designs, leading to problems such as high memory usage, long inference time, and extensive consumption of computing resources. These limitations pose certain challenges for the practical application of HGNNs, especially for resource‑constrained devices. To mitigate this issue, we propose the Spiking Heterogeneous Graph Attention Networks (SpikingHAN), which incorporates the brain‑inspired and energy‑saving properties of Spiking Neural Networks (SNNs) into heterogeneous graph learning to reduce the computing cost without compromising the performance. Specifically, SpikingHAN aggregates metapath‑based neighbor information using a single‑layer graph convolution with shared parameters. It then employs a semantic‑level attention mechanism to capture the importance of different meta‑paths and performs semantic aggregation. Finally, it encodes the heterogeneous information into a spike sequence through SNNs, simulating bioinformatic processing to derive a binarized 1‑bit representation of the heterogeneous graph. Comprehensive experimental results from three real‑world heterogeneous graph datasets show that SpikingHAN delivers competitive node classification performance. It achieves this with fewer parameters, quicker inference, reduced memory usage, and lower energy consumption. Code is available at https://github.com/QianPeng369/SpikingHAN.

Authors:Ahmad Makinde
Title: Temporal Kolmogorov-Arnold Networks (T-KAN) for High-Frequency Limit Order Book Forecasting: Efficiency, Interpretability, and Alpha Decay
Abstract:
High‑Frequency trading (HFT) environments are characterised by large volumes of limit order book (LOB) data, which is notoriously noisy and non‑linear. Alpha decay represents a significant challenge, with traditional models such as DeepLOB losing predictive power as the time horizon (k) increases. In this paper, using data from the FI‑2010 dataset, we introduce Temporal Kolmogorov‑Arnold Networks (T‑KAN) to replace the fixed, linear weights of standard LSTMs with learnable B‑spline activation functions. This allows the model to learn the 'shape' of market signals as opposed to just their magnitude. This resulted in a 19.1% relative improvement in the F1‑score at the k = 100 horizon. The efficacy of T‑KAN networks cannot be understated, producing a 132.48% return compared to the ‑82.76% DeepLOB drawdown under 1.0 bps transaction costs. In addition to this, the T‑KAN model proves quite interpretable, with the 'dead‑zones' being clearly visible in the splines. The T‑KAN architecture is also uniquely optimized for low‑latency FPGA implementation via High level Synthesis (HLS). The code for the experiments in this project can be found at https://github.com/AhmadMak/Temporal‑Kolmogorov‑Arnold‑Networks‑T‑KAN‑for‑High‑Frequency‑Limit‑Order‑Book‑Forecasting.

Authors:Salim Khazem
Title: TopoLoRA-SAM: Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation
Abstract:
Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero‑shot generalization through large‑scale pretraining, but adapting them to domain‑specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine‑tuning is computationally expensive and risks catastrophic forgetting. We propose TopoLoRA‑SAM, a topology‑aware and parameter‑efficient adaptation framework for binary semantic segmentation. TopoLoRA‑SAM injects Low‑Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology‑aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE\_DB1), polyp segmentation (Kvasir‑SEG), and SAR sea/land segmentation (SL‑SSDD), comparing against U‑Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA‑SAM achieves the best retina‑average Dice and the best overall average Dice across datasets, while training only 5.2% of model parameters (~4.9M). On the challenging CHASE\_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology‑aware parameter‑efficient adaptation can match or exceed fully fine‑tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git

Authors:Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia
Title: VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Abstract:
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early‑stage generation; 2) a dynamic time‑step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.

Authors:Almaz Ermilov
Title: FormationEval, an open multiple-choice benchmark for petroleum geoscience
Abstract:
This paper presents FormationEval, an open multiple‑choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept‑based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open‑weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open‑weight models, GLM‑4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open‑weight and closed models is narrower than expected, with several lower‑cost open‑weight models exceeding 90% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.

Authors:Zhuofan Shi, Hubao A, Yufei Shao, Dongliang Huang, Hongxu An, Chunxiao Xin, Haiyang Shen, Zhenyu Wang, Yunshan Na, Gang Huang, Xiang Jing
Title: MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics
Abstract:
Molecular dynamics (MD) simulations are essential for understanding atomic‑scale behaviors in materials science, yet writing LAMMPS scripts remains highly specialized and time‑consuming tasks. Although LLMs show promise in code generation and domain‑specific question answering, their performance in MD scenarios is limited by scarce domain data, the high deployment cost of state‑of‑the‑art LLMs, and low code executability. Building upon our prior MDAgent, we present MDAgent2, the first end‑to‑end framework capable of performing both knowledge Q&A and code generation within the MD domain. We construct a domain‑specific data‑construction pipeline that yields three high‑quality datasets spanning MD knowledge, question answering, and code generation. Based on these datasets, we adopt a three stage post‑training strategy‑‑continued pre‑training (CPT), supervised fine‑tuning (SFT), and reinforcement learning (RL)‑‑to train two domain‑adapted models, MD‑Instruct and MD‑Code. Furthermore, we introduce MD‑GRPO, a closed‑loop RL method that leverages simulation outcomes as reward signals and recycles low‑reward trajectories for continual refinement. We further build MDAgent2‑RUNTIME, a deployable multi‑agent system that integrates code generation, execution, evaluation, and self‑correction. Together with MD‑EvalBench proposed in this work, the first benchmark for LAMMPS code generation and question answering, our models and system achieve performance surpassing several strong baselines.This work systematically demonstrates the adaptability and generalization capability of large language models in industrial simulation tasks, laying a methodological foundation for automatic code generation in AI for Science and industrial‑scale simulations. URL: https://github.com/FredericVAN/PKU_MDAgent2

Authors:Yanhai Gan, Yipeng Chen, Ning Li, Xingguo Liu, Junyu Dong, Xianyao Chen
Title: Explore the Ideology of Deep Learning in ENSO Forecasts
Abstract:
The El Ni~no‑Southern Oscillation (ENSO) exerts profound influence on global climate variability, yet its prediction remains a grand challenge. Recent advances in deep learning have significantly improved forecasting skill, but the opacity of these models hampers scientific trust and operational deployment. Here, we introduce a mathematically grounded interpretability framework based on bounded variation function. By rescuing the "dead" neurons from the saturation zone of the activation function, we enhance the model's expressive capacity. Our analysis reveals that ENSO predictability emerges dominantly from the tropical Pacific, with contributions from the Indian and Atlantic Oceans, consistent with physical understanding. Controlled experiments affirm the robustness of our method and its alignment with established predictors. Notably, we probe the persistent Spring Predictability Barrier (SPB), finding that despite expanded sensitivity during spring, predictive performance declines‑likely due to suboptimal variable selection. These results suggest that incorporating additional ocean‑atmosphere variables may help transcend SPB limitations and advance long‑range ENSO prediction.

Authors:Matthias Bartolo, Dylan Seychell, Gabriel Hili, Matthew Montebello, Carl James Debono, Saviour Formosa, Konstantinos Makantasis
Title: Enhancing Object Detection with Privileged Information: A Model-Agnostic Teacher-Student Approach
Abstract:
This paper investigates the integration of the Learning Using Privileged Information (LUPI) paradigm in object detection to exploit fine‑grained, descriptive information available during training but not at inference. We introduce a general, model‑agnostic methodology for injecting privileged information‑such as bounding box masks, saliency maps, and depth cues‑into deep learning‑based object detectors through a teacher‑student architecture. Experiments are conducted across five state‑of‑the‑art object detection models and multiple public benchmarks, including UAV‑based litter detection datasets and Pascal VOC 2012, to assess the impact on accuracy, generalization, and computational efficiency. Our results demonstrate that LUPI‑trained students consistently outperform their baseline counterparts, achieving significant boosts in detection accuracy with no increase in inference complexity or model size. Performance improvements are especially marked for medium and large objects, while ablation studies reveal that intermediate weighting of teacher guidance optimally balances learning from privileged and standard inputs. The findings affirm that the LUPI framework provides an effective and practical strategy for advancing object detection systems in both resource‑constrained and real‑world settings.

Authors:Peiyan Hu, Haodong Feng, Hongyuan Liu, Tongtong Yan, Wenhao Deng, Tianrun Gao, Rong Zheng, Haoren Zheng, Chenglei Yu, Chuanrui Wang, Kaiwen Li, Zhi-Ming Ma, Dezhi Zhou, Xingcai Lu, Dixia Fan, Tailin Wu
Title: RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data
Abstract:
Predicting the evolution of complex physical systems remains a central problem in science and engineering. Despite rapid progress in scientific Machine Learning (ML) models, a critical bottleneck is the lack of expensive real‑world data, resulting in most current models being trained and validated on simulated data. Beyond limiting the development and evaluation of scientific ML, this gap also hinders research into essential tasks such as sim‑to‑real transfer. We introduce RealPDEBench, the first benchmark for scientific ML that integrates real‑world measurements with paired numerical simulations. RealPDEBench consists of five datasets, three tasks, eight metrics, and ten baselines. We first present five real‑world measured datasets with paired simulated datasets across different complex physical systems. We further define three tasks, which allow comparisons between real‑world and simulated data, and facilitate the development of methods to bridge the two. Moreover, we design eight evaluation metrics, spanning data‑oriented and physics‑oriented metrics, and finally benchmark ten representative baselines, including state‑of‑the‑art models, pretrained PDE foundation models, and a traditional method. Experiments reveal significant discrepancies between simulated and real‑world data, while showing that pretraining with simulated data consistently improves both accuracy and convergence. In this work, we hope to provide insights from real‑world data, advancing scientific ML toward bridging the sim‑to‑real gap and real‑world deployment. Our benchmark, datasets, and instructions are available at https://realpdebench.github.io/.

Authors:Lakshay Sharma, Alex Marin
Title: Subimage Overlap Prediction: Task-Aligned Self-Supervised Pretraining For Semantic Segmentation In Remote Sensing Imagery
Abstract:
Self‑supervised learning (SSL) methods have become a dominant paradigm for creating general purpose models whose capabilities can be transferred to downstream supervised learning tasks. However, most such methods rely on vast amounts of pretraining data. This work introduces Subimage Overlap Prediction, a novel self‑supervised pretraining task to aid semantic segmentation in remote sensing imagery that uses significantly lesser pretraining imagery. Given an image, a sub‑image is extracted and the model is trained to produce a semantic mask of the location of the extracted sub‑image within the original image. We demonstrate that pretraining with this task results in significantly faster convergence, and equal or better performance (measured via mIoU) on downstream segmentation. This gap in convergence and performance widens when labeled training data is reduced. We show this across multiple architecture types, and with multiple downstream datasets. We also show that our method matches or exceeds performance while requiring significantly lesser pretraining data relative to other SSL methods. Code and model weights are provided at \hrefhttps://github.com/sharmalakshay93/subimage‑overlap‑predictiongithub.com/sharmalakshay93/subimage‑overlap‑prediction.

Authors:Siba Smarak Panigrahi, Jovana Videnović, Maria Brbić
Title: HeurekaBench: A Benchmarking Framework for AI Co-scientist
Abstract:
LLM‑based reasoning models have enabled the development of agentic systems that act as co‑scientists, assisting in multi‑step scientific analysis. However, evaluating these systems is challenging, as it requires realistic, end‑to‑end research scenarios that integrate data analysis, interpretation, and the generation of new insights from the experimental data. To address this limitation, we introduce HeurekaBench, a framework to create benchmarks with exploratory, open‑ended research questions for experimental datasets. Each such question is grounded in a scientific study and its corresponding code repository, and is created using a semi‑automated pipeline that leverages multiple LLMs to extract insights and generate candidate workflows, which are then verified against reported findings. We instantiate the framework in single‑cell biology to obtain sc‑HeurekaBench benchmark and use it to compare state‑of‑the‑art single‑cell agents. We further showcase the benefits of our benchmark for quantitatively analyzing current design choices in agentic systems. We find that the addition of a critic module can improve ill‑formed responses for open‑source LLM‑based agents by up to 22% and close the gap with their closed‑source counterparts. Overall, HeurekaBench sets a path toward rigorous, end‑to‑end evaluation of scientific agents, grounding benchmark construction in real scientific workflows.

Authors:Emiliya Khidirova, Oktay Karakuş
Title: UniCrop: A Universal, Multi-Source Data Engineering Pipeline for Scalable Crop Yield Prediction
Abstract:
Accurate crop yield prediction relies on diverse data streams, including satellite, meteorological, soil, and topographic information. However, despite rapid advances in machine learning, existing approaches remain crop‑ or region‑specific and require data engineering efforts. This limits scalability, reproducibility, and operational deployment. This study introduces UniCrop, a universal and reusable data pipeline designed to automate the acquisition, cleaning, harmonisation, and engineering of multi‑source environmental data for crop yield prediction. For any given location, crop type, and temporal window, UniCrop automatically retrieves, harmonises, and engineers over 200 environmental variables (Sentinel‑1/2, MODIS, ERA5‑Land, NASA POWER, SoilGrids, and SRTM), reducing them to a compact, analysis‑ready feature set utilising a structured feature reduction workflow with minimum redundancy maximum relevance (mRMR). To validate, UniCrop was applied to a rice yield dataset comprising 557 field observations. Using only the selected 15 features, four baseline machine learning models (LightGBM, Random Forest, Support Vector Regression, and Elastic Net) were trained. LightGBM achieved the best single‑model performance (RMSE = 465.1 kg/ha, R^2 = 0.6576), while a constrained ensemble of all baselines further improved accuracy (RMSE = 463.2 kg/ha, R^2 = 0.6604). UniCrop contributes a scalable and transparent data‑engineering framework that addresses the primary bottleneck in operational crop yield modelling: the preparation of consistent and harmonised multi‑source data. By decoupling data specification from implementation and supporting any crop, region, and time frame through simple configuration updates, UniCrop provides a practical foundation for scalable agricultural analytics. The code and implementation documentation are shared in https://github.com/CoDIS‑Lab/UniCrop.

Authors:Myung-Hwan Jang, Jeong-Min Park, Yunyong Ko, Sang-Wook Kim
Title: Accelerating Storage-Based Training for Graph Neural Networks
Abstract:
Graph neural networks (GNNs) have achieved breakthroughs in various real‑world downstream tasks due to their powerful expressiveness. As the scale of real‑world graphs has been continuously growing, a storage‑based approach to GNN training has been studied, which leverages external storage (e.g., NVMe SSDs) to handle such web‑scale graphs on a single machine. Although such storage‑based GNN training methods have shown promising potential in large‑scale GNN training, we observed that they suffer from a severe bottleneck in data preparation since they overlook a critical challenge: how to handle a large number of small storage I/Os. To address the challenge, in this paper, we propose a novel storage‑based GNN training framework, named AGNES, that employs a method of block‑wise storage I/O processing to fully utilize the I/O bandwidth of high‑performance storage devices. Moreover, to further enhance the efficiency of each storage I/O, AGNES employs a simple yet effective strategy, hyperbatch‑based processing based on the characteristics of real‑world graphs. Comprehensive experiments on five real‑world graphs reveal that AGNES consistently outperforms four state‑of‑the‑art methods, by up to 4.1X faster than the best competitor. Our code is available at https://github.com/Bigdasgit/agnes‑kdd26.

Authors:Wentao Bian, Fenglei Xu
Title: Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration
Abstract:
In this paper, we revisit multimodal few‑shot 3D point cloud semantic segmentation (FS‑PCS), identifying a conflict in "Fuse‑then‑Refine" paradigms: the "Plasticity‑Stability Dilemma." In addition, CLIP's inter‑class confusion can result in semantic blindness. To address these issues, we present the Decoupled‑experts Arbitration Few‑Shot SegNet (DA‑FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA‑FSS employs the same backbone and pre‑trained text encoder as MM‑FSS to generate text embeddings, which can increase free modalities' utilization rate and better leverage each modality's information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA‑FSS over MM‑FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA‑FSS.

Authors:Akshay Sasi, Malavika Pradeep, Nusaibah Farrukh, Rahul Venugopal, Elizabeth Sherly
Title: Unveiling the Heart-Brain Connection: An Analysis of ECG in Cognitive Performance
Abstract:
Understanding the interaction of neural and cardiac systems during cognitive activity is critical to advancing physiological computing. Although EEG has been the gold standard for assessing mental workload, its limited portability restricts its real‑world use. Widely available ECG through wearable devices proposes a pragmatic alternative. This research investigates whether ECG signals can reliably reflect cognitive load and serve as proxies for EEG‑based indicators. In this work, we present multimodal data acquired from two different paradigms involving working‑memory and passive‑listening tasks. For each modality, we extracted ECG time‑domain HRV metrics and Catch22 descriptors against EEG spectral and Catch22 features, respectively. We propose a cross‑modal XGBoost framework to project the ECG features onto EEG‑representative cognitive spaces, thereby allowing workload inferences using only ECG. Our results show that ECG‑derived projections expressively capture variation in cognitive states and provide good support for accurate classification. Our findings underpin ECG as an interpretable, real‑time, wearable solution for everyday cognitive monitoring.

Authors:Vladimer Khasia
Title: Spectral-Window Hybrid (SWH)
Abstract:
Scaling sequence modeling to extreme contexts requires balancing computational efficiency with representational expressivity. While Transformers provide precise retrieval via the attention mechanism, their quadratic \mathcalO(T^2) complexity limits their application to long‑horizon tasks. In this work, we propose the Spectral‑Window Hybrid (SWH), an architecture that decouples sequence modeling into two parallel streams: a global branch utilizing the Convolution Theorem to model long‑range decay dynamics in \mathcalO(T \log T) time, and a local branch employing sliding‑window attention for token interactions within a bounded context. By aggregating these representations, SWH avoids the computational bottleneck of global attention while retaining local precision. We demonstrate that SWH matches the perplexity of standard Transformers on short contexts while enabling efficient linear scaling to extended sequences. The code is available at https://github.com/VladimerKhasia/SWH

Authors:Keith Frankston, Benjamin Howard
Title: Accelerating Monte-Carlo Tree Search with Optimized Posterior Policies
Abstract:
We introduce a recursive AlphaZero‑style Monte‑‑Carlo tree search algorithm, "RMCTS". The advantage of RMCTS over AlphaZero's MCTS‑UCB is speed. In RMCTS, the search tree is explored in a breadth‑first manner, so that network inferences naturally occur in large batches. This significantly reduces the GPU latency cost. We find that RMCTS is often more than 40 times faster than MCTS‑UCB when searching a single root state, and about 3 times faster when searching a large batch of root states. The recursion in RMCTS is based on computing optimized posterior policies at each game state in the search tree, starting from the leaves and working back up to the root. Here we use the posterior policy explored in "Monte‑‑Carlo tree search as regularized policy optimization" (Grill, et al.) Their posterior policy is the unique policy which maximizes the expected reward given estimated action rewards minus a penalty for diverging from the prior policy. The tree explored by RMCTS is not defined in an adaptive manner, as it is in MCTS‑UCB. Instead, the RMCTS tree is defined by following prior network policies at each node. This is a disadvantage, but the speedup advantage is more significant, and in practice we find that RMCTS‑trained networks match the quality of MCTS‑UCB‑trained networks in roughly one‑third of the training time. We include timing and quality comparisons of RMCTS vs. MCTS‑UCB for three games: Connect‑4, Dots‑and‑Boxes, and Othello.

Authors:Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung
Title: Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models
Abstract:
Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within‑dataset co‑occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention‑weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic‑aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM‑enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19‑27%. Code is available at https://github.com/develop‑yang/ARISE

Authors:Bryon Tjanaka, Henry Chen, Matthew C. Fontaine, Stefanos Nikolaidis
Title: Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces
Abstract:
Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user‑specified, vector‑valued measure function. Contemporary QD algorithms focus on low‑dimensional measures because high‑dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the CMA‑MAE algorithm guides measure space exploration with a histogram in measure space that records so‑called discount values. However, CMA‑MAE stagnates in domains with high‑dimensional measure spaces because solutions with similar measures fall into the same histogram cell and thus receive identical discount values. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high‑dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new QD applications by introducing two domains where the measure space is the high‑dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand‑designing the measure function. Results in these domains and on high‑dimensional benchmarks show that DMS outperforms CMA‑MAE and other black‑box QD algorithms.

Authors:Shiao Wang, Xiao Wang, Haonan Zhao, Jiarui Xu, Bo Jiang, Lin Zhu, Xin Zhao, Yonghong Tian, Jin Tang
Title: Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking
Abstract:
Existing RGB‑Event visual object tracking approaches primarily rely on conventional feature‑level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion‑sensitive nature of event cameras are often overlooked, while low‑information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high‑frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High‑frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion‑guided spatial sparsification module leverages the motion‑sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low‑information regions and enhancing target‑relevant features. Finally, a sparse set of target‑relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB‑Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on https://github.com/Event‑AHU/OpenEvTracking

Authors:Subhankar Mishra
Title: Clean-GS: Semantic Mask-Guided Pruning for 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting produces high‑quality scene reconstructions but generates hundreds of thousands of spurious Gaussians (floaters) scattered throughout the environment. These artifacts obscure objects of interest and inflate model sizes, hindering deployment in bandwidth‑constrained applications. We present Clean‑GS, a method for removing background clutter and floaters from 3DGS reconstructions using sparse semantic masks. Our approach combines whitelist‑based spatial filtering with color‑guided validation and outlier removal to achieve 60‑80% model compression while preserving object quality. Unlike existing 3DGS pruning methods that rely on global importance metrics, Clean‑GS uses semantic information from as few as 3 segmentation masks (1% of views) to identify and remove Gaussians not belonging to the target object. Our multi‑stage approach consisting of (1) whitelist filtering via projection to masked regions, (2) depth‑buffered color validation, and (3) neighbor‑based outlier removal isolates monuments and objects from complex outdoor scenes. Experiments on Tanks and Temples show that Clean‑GS reduces file sizes from 125MB to 47MB while maintaining rendering quality, making 3DGS models practical for web deployment and AR/VR applications. Our code is available at https://github.com/smlab‑niser/clean‑gs

Authors:Gihyeon Sim
Title: When to Ponder: Adaptive Compute Allocation for Code Generation via Test-Time Training
Abstract:
Large language models apply uniform computation to all inputs, regardless of difficulty. We propose PonderTTT, a gating strategy using the TTT layer's self‑supervised reconstruction loss to selectively trigger Test‑Time Training (TTT) updates. The gating decision itself is training‑free‑‑requiring no learned classifier or auxiliary networks; only a single scalar threshold is initially calibrated on unlabeled data and continuously adapted via EMA to maintain target update rates. Our experiments with GPT‑2 models (124M to 1.5B) on code language modeling (The Stack v2, teacher‑forced perplexity) demonstrate that this signal is inference‑compatible, requiring no ground‑truth labels. Our Reconstruction Gating achieves 82‑89% Oracle Recovery while being fully training‑free, significantly outperforming Random Skip baselines (up to 16% lower loss on OOD languages).

Authors:Thomas Katraouras, Dimitrios Rafailidis
Title: Memory Bank Compression for Continual Adaptation of Large Language Models
Abstract:
Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine‑tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory‑augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real‑world scenario when large‑scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key‑Value Low‑Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question‑answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.

Authors:Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
Title: Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Abstract:
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one‑way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real‑time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real‑time user‑avatar interactions through diffusion forcing. This design allows the avatar to process real‑time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non‑verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label‑free learning of expressive interaction. Experimental results demonstrate that our framework enables real‑time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

Authors:Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan
Title: E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
Abstract:
Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll‑outs. To this end, we propose E‑GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi‑step group normalized advantage, which computes group‑relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.

Authors:Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu
Title: Deep Delta Learning
Abstract:
The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data‑dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank‑1 perturbation of the identity matrix, parameterized by a reflection direction vector \mathbfk(\mathbfX) and a gating scalar β(\mathbfX). We provide a spectral analysis of this operator, demonstrating that the gate β(\mathbfX) enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank‑1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer‑wise transition operator, enabling the modeling of complex, non‑monotonic dynamics while preserving the stable training characteristics of gated residual architectures.

Authors:Aditya Sai Ellendula, Yi Wang, Minh Nguyen, Chandrajit Bajaj
Title: GRL-SNAM: Geometric Reinforcement Learning with Path Differential Hamiltonians for Simultaneous Navigation and Mapping in Unknown Environments
Abstract:
We present GRL‑SNAM, a geometric reinforcement learning framework for Simultaneous Navigation and Mapping(SNAM) in unknown environments. A SNAM problem is challenging as it needs to design hierarchical or joint policies of multiple agents that control the movement of a real‑life robot towards the goal in mapless environment, i.e. an environment where the map of the environment is not available apriori, and needs to be acquired through sensors. The sensors are invoked from the path learner, i.e. navigator, through active query responses to sensory agents, and along the motion path. GRL‑SNAM differs from preemptive navigation algorithms and other reinforcement learning methods by relying exclusively on local sensory observations without constructing a global map. Our approach formulates path navigation and mapping as a dynamic shortest path search and discovery process using controlled Hamiltonian optimization: sensory inputs are translated into local energy landscapes that encode reachability, obstacle barriers, and deformation constraints, while policies for sensing, planning, and reconfiguration evolve stagewise via updating Hamiltonians. A reduced Hamiltonian serves as an adaptive score function, updating kinetic/potential terms, embedding barrier constraints, and continuously refining trajectories as new local information arrives. We evaluate GRL‑SNAM on two different 2D navigation tasks. Comparing against local reactive baselines and global policy learning references under identical stagewise sensing constraints, it preserves clearance, generalizes to unseen layouts, and demonstrates that Geometric RL learning via updating Hamiltonians enables high‑quality navigation through minimal exploration via local energy refinement rather than extensive global mapping. The code is publicly available on \hrefhttps://github.com/CVC‑Lab/GRL‑SNAMGithub.